Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Quantitative Association Rules in Practice Building a working data mining system: architecture, workflow, and presentation of data mining results DMITRY PALAGIN Master of Science Thesis Stockholm, Sweden 2009 Mining Quantitative Association Rules in Practice Building a working data mining system: architecture, workflow and presentation of data mining results DMITRY PALAGIN Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2009 Supervisor at CSC was Stefan Nilsson Examiner was Stefan Arnborg TRITA-CSC-E 2009:005 ISRN-KTH/CSC/E--09/005--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se Abstract In this thesis, we present a practical framework for mining quantitative association rules and give recommendations regarding presentation of the discovered rules. A system for mining and working with quantitative association rules was designed. We suggest a general system architecture that covers a range of practical applications, spreads logically separate functionality between several independent modules, and scales well with the number of clients served by the system. Further, we motivate a choice of a suitable mining technique and present an alternative mechanism for generating rules from trend rule candidates. A workflow allowing users without previous background in data mining to make effective use of mined rules to support everyday work is described. Two models for assessing interestingness of rules are suggested. Impact is an interestingness measure based on the difference between the observed average and the expected average as well as on the number of data records behind the rule. Hotness is a user-specific interestingness value combining statistical significance of the rule with collective intelligence of all users and personal preferences of the current user. The segment-to-item technique for visualizing multiple rules in a compact and accessible format was developed, as well as the metasegment-to-item approach that is a further generalization of this idea. In addition, ways of using data visualization to facilitate analysis of individual rules are described. A working system based on these ideas was implemented and integrated into an industrial software product proving that the developed techniques and designs are feasible in a practical application. Sammanfattning Titel: Kvantitativa associationsregler i praktiken. Att bygga ett fungerande informationsutvinningssystem: arkitektur, arbetsflöde och presentation av resultat. I denna uppsats presenterar vi ett praktiskt ramverk för utvinning av kvantitativa associationsregler och ger rekommendationer för hur framtagna regler kan presenteras. Ett system för informationsutvinning och arbete med kvantitativa associationsregler har tagits fram. Vi föreslår en generell systemarkitektur som täcker flera praktiska användningsområden, delar enskilda funktioner mellan flera oberoende moduler och är skalbart med hänsyn till antalet klienter som betjänas av systemet. Därefter motiverar vi valet av en lämplig utvinningsteknik och presenterar en alternativ mekanism för framtagning av regler från trendregelkandidater. Vidare beskriver vi ett arbetsflöde som möjliggör en effektiv användning av resultaten från informationsutvinningen i dagligt arbete för användare utan tidigare bakgrund i data mining. Två modeller för relevansbedömning av regler föreslås. Impact är ett relevansmått som bygger på skillnaden mellan det observerade och det förväntade genomsnittet samt på antalet datafall bakom regeln. Hotness är ett användarspecifikt relevansmått där regelns statistiska signifikans kombineras med den aggregerade kunskapen från alla användare samt den aktuella användarens individuella preferenser. Tekniken segment-to-item, som kan användas för visualisering av ett flertal regler i ett kompakt och lättillgängligt format, har tagits fram, liksom tekniken metasegment-to-item, som är en mer generell vidareutveckling av denna idé. Därtill beskriver vi olika sätt att använda visualisering som ett medel för analys av enstaka regler. Ett fungerande system som bygger på dessa idéer har implementerats och integrerats i en kommersiell mjukvaruprodukt. Därmed har de presenterade teknikernas och modellernas genomförbarhet i praktiska tillämpningar kunnat påvisas. Preface I would like to express my gratitude to everybody who helped me directly or indirectly in the course of the work presented in this paper. In particular, I want to thank (in alphabetical order): · Al for keeping my email box filled up. · Amit for making me feel welcomed at the Red Cottage Inn. · David for giving up his office so easily and for the wonderful stickers on his laptop. · Den for knowing everything about chemistry and for his research on the electronic structure and geometry of anionic copper clusters in particular. · Elise for teaching me to drink coffee without milk and sugar, for letting me use the municipal grand piano and for repeatedly picking me up from work late at night. · Erik for patiently answering my countless configuration questions and for giving me shelter during my time as homeless in Oslo. · Erin for showing me around. · Erling for nearly magically patching a running application on a remote server before a workshop with an important client. · Georg and Guro for giving me reasons to stay late at work and come to the office on weekends. · Juan Pablo for the discussion regarding career. · Lars Christian for lending me his guidebook to San Francisco. · Lise for being a delightful shopping mate. · Max for consistently beating me in table tennis and for cleaning out the office kitchen pipes with me on a beautiful summer Saturday (and for helping me out with this project as well). · Michael for helping me out with the abstract in the nick of time. · My family for all the inspiration and support. · Øyvind for always having a solution to any problem that I though would doom the whole project, as well as for telling me the story about the tourists and a hot dog. · Rostik for taking me and Max out swimming. · Rune for the brainstorming sessions and for his signature Chinese serve. · Sarah for her energy and sprightly spirit. · Snorre for teaching me the Trondheim dialect. · Stefan for eagerly agreeing to be the academic supervisor of the project. · Stephen and Hugh for “Jeeves and Wooster”. · Vidar for saying “The data mining module is up and running... Cough...” · West Records for being so selflessly inadequate. Contents 1 Introduction...................................................................................................................... 1 1.1 1.1.1 Data mining ...................................................................................................... 1 1.1.2 Association rules ............................................................................................... 2 1.2 Purpose ............................................................................................................. 2 1.2.2 Limitations........................................................................................................ 3 Report outline ........................................................................................................... 3 Related work .................................................................................................................... 5 2.1 Data mining concepts................................................................................................ 5 2.1.1 Decision trees.................................................................................................... 5 2.1.2 Clustering ......................................................................................................... 6 2.1.3 Naive Bayes classifier ....................................................................................... 6 2.1.4 Artificial neural networks .................................................................................. 7 2.1.5 Nearest-neighbour techniques............................................................................ 8 2.1.6 Support vector machines ................................................................................... 8 2.1.7 Collective intelligence....................................................................................... 9 2.2 Rule mining ............................................................................................................ 10 2.2.1 Association rules ............................................................................................. 10 2.2.2 Apriori ............................................................................................................ 10 2.2.3 Quantitative association rules .......................................................................... 11 2.3 Rule interestingness ................................................................................................ 11 2.3.1 Evaluating unexpectedness .............................................................................. 12 2.3.2 User-defined interestingness evaluator............................................................. 12 2.3.3 Using users’ interactive feedback .................................................................... 12 2.3.4 Selecting an interestingness measure ............................................................... 13 2.3.5 Using click-through data ................................................................................. 14 2.4 Visualization........................................................................................................... 15 2.4.1 Two-dimensional matrix.................................................................................. 16 2.4.2 Directed graph................................................................................................. 16 2.4.3 Other techniques ............................................................................................. 16 2.4.4 Data visualization............................................................................................ 17 2.5 3 Problem statement..................................................................................................... 2 1.2.1 1.3 2 Background .............................................................................................................. 1 Mining survey data ................................................................................................. 17 Design of the system ...................................................................................................... 21 3.1 General vision......................................................................................................... 21 4 5 6 7 3.2 Choice of a mining technique.................................................................................. 21 3.3 System requirements ............................................................................................... 22 3.4 Perceived complexity and workflow........................................................................ 26 3.5 System architecture................................................................................................. 28 3.5.1 Application server ........................................................................................... 29 3.5.2 Data mining engine ......................................................................................... 30 3.5.3 Rule server ...................................................................................................... 31 3.5.4 Control module ............................................................................................... 33 3.5.5 Discussion....................................................................................................... 33 Interestingness models.................................................................................................... 35 4.1 Impact .................................................................................................................... 35 4.2 Pruning trend rules.................................................................................................. 36 4.3 Diff vs. p-value....................................................................................................... 36 4.4 Collective interestingness........................................................................................ 38 4.5 Personal interestingness .......................................................................................... 42 4.6 Combined relevance................................................................................................ 42 4.7 Hotness................................................................................................................... 44 Rule visualization ........................................................................................................... 45 5.1 Displaying multiple rules ........................................................................................ 45 5.2 Data visualization for individual rules ..................................................................... 48 Results and discussion .................................................................................................... 50 6.1 Simplified architecture............................................................................................ 50 6.2 User interface.......................................................................................................... 52 6.3 Similarity between rule attributes ............................................................................ 55 6.4 Comments mining................................................................................................... 56 6.5 Streamlined workflow............................................................................................. 56 6.6 Difference value in the GUI .................................................................................... 57 6.7 Interestingness ........................................................................................................ 58 6.8 Visualization........................................................................................................... 59 6.9 General remarks...................................................................................................... 60 Conclusions.................................................................................................................... 61 References ............................................................................................................................. 63 1 Introduction The problem of analyzing large amounts of data to find hidden and previously unknown information is commonly referred to as data mining. Association rule mining is an unsupervised form of data mining that deals with locating relations between items in a relational database. There are numerous areas where this type of mining can be utilized. In this work, however, we will focus on business applications of association rule mining, and specifically on mining customer satisfaction data. Data warehousing in one form or another is today employed in most businesses, and the idea to go beyond using the data for purely operational purposes and start learning from it is not only logical, but also realistic as the amounts of gathered data increase. On the one hand, it is always useful to gain new general insights into customer behaviour. On the other hand, profit can be gained from being able to notice and react to specific changes in customer satisfaction early on. The remainder of this chapter provides the necessary background on data mining, explains the purpose of the current project, and gives an overview of the rest of the thesis. 1.1 Background This section gives a short introduction to data mining in general and association rules in particular in the extent necessary to understand the problem statement. 1.1.1 Data mining Data mining is the exploration and analysis of large quantities of data in order to discover relevant information. The term “knowledge discovery in databases” is also used, and data mining in this context is defined as “the science of extracting useful information from large data sets or databases” [17]. As obscure as it sounds, data mining employs a wide range of techniques and has a variety of practical applications. In [8], Berry and Linoff suggest that many practical problems can be phrased in terms of the following six major tasks that can be solved with data mining: · Classification, or examining the features of a presented object and assigning it to a class from a pre-defined set. The input of a classification task is a set of classes and a set of pre-classified objects (the so-called training set). · Estimation assigns objects with values for some variable that is typically continuous. Examples include estimating a response probability, price, income or number of children in the family. · Prediction resembles classification and estimation, the major difference being that the objects are classified or estimated according to some predicted future behaviour or value. · Affinity grouping is used to determine which things go together, for example which goods are bought together in a supermarket. · Clustering, or segmenting a heterogeneous input population into a number of more homogeneous subgroups. Clustering resembles classification, but it does not require predefined classes or training examples. · Profiling, i.e. describing the data in a way that increases our understanding of the factors or processes behind the data. This is useful because a good enough description of a phenomenon may often suggest an explanation for it as well. 1 CHAPTER 1. INTRODUCTION Directed data mining attempts to explain, categorize or estimate a particular target field, whereas undirected data mining works without the use of a specified target field or a predefined set of classes. Classification, estimation and prediction are examples of directed data mining, while affinity grouping and clustering are undirected data mining tasks. Profiling can be directed or undirected. The terms supervised and unsupervised learning are related to directed and undirected mining. Supervised learning uses example inputs and outputs to learn how to make predictions, while the purpose of unsupervised learning techniques is to find structure within a set of data where no one piece of data is the answer [30]. 1.1.2 Association rules A prototypical example of affinity grouping is market basket analysis in the retail business. Market basket analysis is used for finding patterns in the purchase behaviour of customers by looking at which products are purchased together. For example, this can be used to optimize selling efforts by arranging such products together physically in a store or by presenting a selection of popular related items to the user who buys a product in an electronic store. A typical output of market basket analysis can be stated in form of a rule. For example, “People who buy product A also buy product B”. Such rules are called association rules and were introduced by Agrawal, Imielinski and Swami in [2]. In [3], Agrawal and Srikant developed the Apriori algorithm for mining this type of rules. In [5], Aumann and Lindell introduced the concept of a quantitative association rule later labelled as impact rule in [35]. In an impact rule, the rule antecedent describes a subset of the population, and the consequent consists of one or more quantitative attributes that have an unexpected mean value for this subset. For example: Gender = Female => Mean wage = $7.90 (Overall mean wage = $9.02) An extension of the Apriori algorithm capable of finding impact rules was suggested in [5]. For a more detailed survey of research within association rule mining, consult [6] and [16], as well as sections 2.2 and 2.5. 1.2 Problem statement This section outlines the purpose and scope of the current thesis. The focal points and limitations of the project are listed in the corresponding subsections. Data mining is generally regarded as an interactive process, requiring an analyst to repeatedly calibrate and run the mining software and thereupon analyze and process the results. While association rule discovery is a relatively well-studied area, making the results useful is much less so. As such, the huge benefits of data mining are not directly available to end-users. 1.2.1 Purpose The objective of this thesis is to develop a practical framework for mining quantitative association rules, and to give concrete recommendations on how to present the discovered rules in a useful and apprehensible way to end-users without technical background in data mining. Ideas suggested in the current paper should be complete enough to enable implementation of a functional data mining system. The setting is assumed to be an application with a large and continuously growing database with millions of records and a set of users regularly using the application on a daily or weekly basis. More specifically, the database contains demographic, behavioural and satisfaction data about 2 1.3. REPORT OUTLINE clients, while users of the application represent the business on different levels. Target association rules are supposed to link client segments with satisfaction values. We are interested in using the results obtained from data mining over such a database in the application. Topics to study and the most immediate considerations are listed below. 1. Development of a sensible workflow in the abovementioned setting. Frequent users could be given the ability to mark, hide and track rules. In particular, effort needs to be put into preventing the same rules from being shown time after time to such users. a. Provide a design for the recommended solution. The solution should be scalable with both the number of users and the number of rules. b. For a better presentation of results, a scheme for converting a rule to a regular sentence should be suggested and suitable graphs used to visualize each rule. 2. System architecture for the whole association rule mining system. The suggested architecture should make a solid ground for a real implementation. 3. If and how can explicit and implicit user feedback be used to make results more interesting, for instance by hiding, emphasizing or reordering rules that are shown. 4. Methods to display a set of association rules. A list of hundreds of rules in text format may deter many a user. Perhaps a tree or some other graphical format can be used to convey the same information, yet in a more user-friendly way. 1.2.2 Limitations No matter how accurate a data mining system is and how intuitive the results are, they are of little use unless the organization that uses the system is determined to continuously work with these results and act accordingly. Methodology of adjusting the target organization to make effective use of data mining is outside the scope of this thesis. Data collection methods are not discussed here. We assume that the available data is suitable for mining, is not significantly biased and represents the underlying data set well enough. Privacy and ethical concerns associated with data mining are not taken into consideration in this thesis. It is assumed that data for mining is collected with respect to applicable legislation. Throughout the thesis, the discussion will be held on a general level wherever possible. However, the solution as a whole is in the first place intended to be used in the setting as described in the section Purpose. In particular, this thesis concerns quantitative association rules specifically and not all types of association rules. 1.3 Report outline The rest of the document is organized as follows. Chapter 2 introduces relevant existing work that influenced the design decisions in the current project. It includes a summary of data mining concepts, techniques for association rule mining, approaches to estimating interestingness of rules and known rule visualization techniques. A system design for the suggested rule mining solution is presented in Chapter 3. A suitable mining technique is chosen, important requirements on the target system are listed, and a workflow suitable in this setting is described. Chapter 4 introduces and motivates two models for rule interestingness calculation. For a given rule, its impact is based on the difference between the observed average satisfaction and the expected satisfaction value as well as on the number of responses the rule is based on. The socalled hotness is a user-specific value combining statistical significance of the rule with collective intelligence of all users and personal preferences of the current user. 3 CHAPTER 1. INTRODUCTION In Chapter 5, the segment-to-rule visualization scheme for presenting multiple rules in a compact and informative graphical format as an alternative to a list of rules in text form is introduced and analyzed. Also, data visualization for individual rules is discussed. The solution implemented in practice is described in Chapter 6. Identified issues and possible remedies are discussed. Among other topics, the use of the suggested interestingness measures is discussed and ways to further improve the implemented visualization tools are presented. Chapter 7 wraps up the thesis, compares the achieved results with the goals of the project and points out directions for future work. 4 2 Related work In the following sections, relevant existing work that influenced the design decisions in the project is summarized. The summary starts with a brief explanation of popular data mining concepts, such as decision trees, clustering, naive Bayes classifier, nearest-neighbour methods, and support vector machines. The concept of collective intelligence is introduced and a model for applying it is outlined. Further, association rules, quantitative association rules and ways of mining them are discussed, followed by a review of ways of measuring interestingness of association rules and a survey of known rule visualization techniques. Using click-through data for evaluating and learning retrieval functions is looked upon. A separate subsection is dedicated to research on mining categorical-to-quantitative impact rules from survey data. 2.1 Data mining concepts 2.1.1 Decision trees A decision tree is a structure that divides up a collection of observations into successively smaller sets by applying simple decision rules. With each division, the sets become more homogeneous with respect to a predefined target variable. Figure 1 shows a simple decision tree that classifies fruit type by colour, shape and other parameters. Thus, fruit type is the target variable in this example. Colour = green yes no Colour = red no Diameter > 6in yes no Shape = round Diameter > 2in no yes no Banana Diameter > 4 in Has stone no Lemon yes Apple yes no yes Grapefruit Grape Cherry yes Diameter > 2in no yes Grape Apple Watermelon Figure 1. Example of a decision tree. The target variable is usually categorical, and the tree is used either to calculate the probability that a given record belongs to a specific class, or to classify the record by assigning it to the most likely class. It is also possible to use decision trees to estimate values of continuous variables, but using decision trees in this context is not usual. Decision trees need to be trained. To build a decision tree with respect to a given target variable from a set of pre-classified examples, a recursive best-split approach is used. In each step, the set is divided into two subsets such that a single class predominates in each group. When searching for a binary split on a numeric input variable, each value that the variable takes in the training set can be treated as a candidate split value and the final split takes the form X < a . When splitting on a categorical value, it is typical to either split on a belongs-to-a-specific-classor-not basis, or to group together classes that predict similar outcomes. Creating a new branch for each class is not usually used, as it makes the tree more complicated and, as there are fewer training records available at each node on the lower levels of the tree, often makes further splits less reliable. 5 CHAPTER 2. RELATED WORK There are several ways of choosing the best split. The Gini criterion is based on the probability that two items chosen at random from the same population belong to the same class. The score of a split is the sum of the scores of each child node multiplied by the proportion of records belonging to it. Another popular criterion is entropy reduction produced by the split. A decision tree keeps growing as long as new splits can be found that separate the training set into increasingly pure subsets. The resulting tree can not only become very complex but also get optimized for the training set, which can increase the number of errors when new data sets are classified. To address this problem, the final tree is usually simplified through pruning, a process that eliminates the least stable splits in the tree. CART and C5 are the most popular pruning techniques [8]. The strongest advantage of decision trees is that they are easy to read and understand, and it is straightforward to motivate their predictions. Further, decision trees require relatively little data preparation. There is no need to normalize data or to remove null values. Decision trees have been used in a wide variety of applications, such as customer profiling, financial risk analysis, assisted diagnosis, and traffic prediction to name a few. In practice, an automatically generated decision tree is often used by domain experts to better understand the key factors behind the studied phenomenon, and helps to further refine research focus [30]. Because decision trees are a combination of data exploration and modelling, they can be used as a first step in the modelling process even when the final model will be built using other techniques. See Berry and Linoff [8] for a deeper review of research on decision trees. 2.1.2 Clustering Clustering essentially means partitioning data into groups of items that are similar to each other in some way. To be able to cluster a data set it is necessary to have a similarity measure, or a distance function returning a numerical distance value between two records. Hierarchical clustering, also known as agglomerative clustering, is a simple clustering method. It works by continuously merging the two most similar groups. Each of these groups starts as a single item, and the closest groups are merged together until there is only one group left. One of the most commonly used clustering algorithms is k-means clustering. The algorithm starts with randomly placing k centroids, i.e. points representing the centre or each cluster, and assigns each record to the nearest centroid. After that, these centroids are replaced with the average of all the points assigned to them, and the assignments are redone. This process repeats until the assignments stop changing. Like decision trees, clustering can be used in order to gain an understanding of a given data set. Unlike decision trees, clustering does not explain why data records belong together other than that the distance between them is small enough. Also, decision trees maximize the leaf purity of a target variable, while clustering only uses distance between data records. A popular application of clustering is customer segmentation [8]. Other uses include similarity search, pattern recognition, trend analysis and classification [1]. 2.1.3 Naive Bayes classifier The naive Bayes classifier is based on Bayes’ theorem and assumes that the presence or absence of features of a class is unrelated to the presence or absence of other features. More precisely, the probability that the input belongs to a specific class is calculated according to the formula: n p(C F1 ,..., Fn ) = p(C ) × Õ p (Fi C ) i =1 p(F1 ,..., Fn ) 6 2.1. DATA MINING CONCEPTS The classifier uses supervised learning. Among the advantages of the model is that it is simple to understand, training is straightforward and can be incremental, and that the classification process is fast. The downside is the assumption that all features are independent. The naive Bayes classifier is easy to construct and it reportedly has surprisingly good performance in text classification [38], even though the conditional independence assumption is rarely true in real-world applications. On the other hand, naive Bayes has been found to produce poor probability estimates [7]. 2.1.4 Artificial neural networks An artificial neural network is a computational model based on the biological model of how a brain works. It consists of artificial neurons that are connected to each other by synapses. Typically, neural networks are adaptive, i.e. they change their internal structure based on the examples fed into the network during the learning phase. There exist various kinds of artificial neural networks. The general idea is well illustrated by the feed-forward multi-layer perceptron network. The basic structure of such a network is shown in Figure 2. The network maps a layer of input variables to a layer of output variables using hidden neuron layers. Each synapse has an associated weight. The outputs from one set of neurons are fed to the next layer through the synapses. The higher the weight of the synapse leading from one neutron to the next, the more influence it will have on the output of that neutron. Outputs from the different synapses are typically combined using a sigmoid-type function. For example, the combined input can be defined by the formula tanh(å xi × wi ) , where xi are the values at the input synapses, and wi are the corresponding weights. Hidden 1 Input 1 Output 1 Hidden 2 Input 2 Hidden 3 Output 2 Input 3 Hidden 4 Figure 2. Basic structure of an artificial neural network. Neural networks can be used as both a directed and undirected learning method [6]. With supervised learning, the network must be trained on a number of examples. This can for example be done using back-propagation [30]. The input values are fed to the network, the outputs are compared with the output values from the example, and the weights are adjusted by going back in the network to make the result close to the right output. It is important to understand that the network will only be as good as the training set used to generate it, and that it needs to be retrained in order to keep it up-to-date and useful [8]. In the unsupervised setting, the network adapts itself to the input data. This is used for clustering and dimensionality reduction. Tasks that can be solved with supervised learning include classification and prediction problems, pattern recognition and function approximation [6]. Many examples of successful application of artificial neural networks are known [8], and the concept is popular. One major issue with this technology is that it is difficult to explain the reasoning that happens inside the network. A trained neural network can be seen as a black box or an oracle giving answers without explanation. In many business situations this may be considered a problem. 7 CHAPTER 2. RELATED WORK 2.1.5 Nearest-neighbour techniques Nearest-neighbour techniques are based on the concept of similarity. Another closely related concept is memory-based reasoning, which is a nearest neighbour technique that uses analogous examples from the past to make predictions about new situations, similarly to how humans approach new problems by using experiences and memories from the past. Memory-based reasoning can be used to solve various estimation and classification problems [8]. To be able to employ memory-based reasoning, it is necessary to have a distance function capable of determining distance between two records, and a combination function capable of bringing together results from several neighbours to produce a result for the given query. For example, if a database of prices for various wine sorts, a way of calculating a distance between two wine sorts, and a way of combining prices of several wine sorts are available, it is possible to predict prices for arbitrary sorts of wine. The quality of prediction will of course depend on how well the database represents the space of the possible wine sorts as well as on the quality of the two functions. Memory-based reasoning is a highly adaptive technique. There is usually no need of explicitly teaching the system. Simply incorporating new data into the historical database is enough. The downside is that classifying new records can be resource-intensive as it typically requires processing all available historical data. Also, finding good distance and combination functions can be difficult [30]. An important strength of memory-based reasoning is that the results it suggests are easy to understand and interpret. k-Nearest neighbours (frequently denoted as k-NN) is an example of such a technique. To make a prediction for the input record, k nearest records from the historical database are found and the average of their corresponding values is returned. Clearly, such a simple calculation can result in low quality predictions when the nearest neighbours are far away from each other. Instead of averaging the values, they can be weighted in using a weight function that is inversely proportional to the distance, thus leading to the principal calculation in the form: å xi × w(di ) å w(di ) In the formula above, i iterates over the k nearest neighbours of the input record, xi is the value of the corresponding neighbour, and di is the distance from a neighbour to the input record. The technique is illustrated in Figure 3. nearest neighbours x2 x1 x6 d2 d1 d6 d3 x3 d5 x5 d4 x4 Figure 3. Using 4-nearest neighbours to determine a target value. 2.1.6 Support vector machines Support vector machines, or SVMs, are a popular technique for solving classification tasks. In its basic form, an SVM takes a set of n-dimensional vectors, each of which belongs to one of two predefined classes. Further, a hyperplane separating the vectors of the two classes is con- 8 2.1. DATA MINING CONCEPTS structed, such that the margin between the two subsets is maximized. When that is achieved, new vectors can be classified according to which side of the hyperplane they belong to. The mathematics behind this is relatively simple and employs basic linear algebra and multi-variable calculus (in particular, the Lagrange method for locating extreme points of a function is used). It can be shown that the arising Lagrange optimization problem can be reformulated (the socalled Wolfe dual), and in the equivalent formulation it becomes clear that only the training points that are closest to the optimal hyperplane are actually necessary, while the rest of the points can be left out of the training set [11]. These points are called support vectors (see Figure 4), hence the term “support vector machine”. Support vectors Support vectors Margin Figure 4. The optimal hyperplane dividing two classes of vectors. The idea above applies to the simplest case – a linear support vector machine and separable data (training set for which the optimal hyperplane can be found). In the case of non-separable training data, it is possible to relax the constraints of the optimization problem by introducing positive slack variables that account for the possible training errors [13]. In most practical situations, the decision function (the function deciding which of the two classes an input vector belongs to) is not linear, i.e. the training data can not be separated by a simple hyperplane. In this case, it may be possible to map the training data to another space where the two classes can be separated linearly. The question remains how to find such a mapping. Another difficulty is that computations in a higher-dimensional space are more computationally expensive. The observation that calculations in an SVM only involve dot products of the input vectors suggests that if a function of the form K(xi, xj) = Φ(xi) ∙ Φ(xj) can be found, where Φ(x) is the mapping from the original space to the new one, then the original dot products can be replaced by K(xi, xj) and the new SVM will work roughly as fast as the original one without explicitly knowing what Φ(x) is. This is the essence of the so-called kernel trick [10]. Finding an appropriate kernel function is difficult. In practice, one of the well-studied kernel functions is typically chosen. In the case when it is necessary to distinguish between more than two separate classes, several approaches can be used. Firstly, an SVM for each class can be trained that determines whether the input vector belongs to the corresponding class or not. Secondly, an SVM can be trained for each pair of classes, and all outputs compared. Finally, it is possible to modify the SVM to allow classifying multiple classes at once [29]. More detailed information about support vector machines can be found in the excellent tutorial on the basic ideas behind SVMs by Burges [11]. 2.1.7 Collective intelligence In [4], Alag emphasises the growing importance of the user-centric approach in application construction. One of the seven principles of Web 2.0, harnessing collective intelligence is by many considered to be the heart of this approach. 9 CHAPTER 2. RELATED WORK In essence, collective intelligence of users is the intelligence that can be extracted from the collective set of interactions and contributions made by the users, and, even more importantly, the use of this intelligence to act as a filter for what is valuable in the application for a user. This filter takes into account a user’s preferences and interactions to provide relevant information to the user. Alag suggests the following model for applying collective intelligence: · Learn about each user through individual interactions and contributions. · Learn about all users in the aggregate through their interactions and contributions. · Build models that can recommend relevant content to a user using the information learnt about the user and the users in aggregate. 2.2 Rule mining Here, several types of association rules are discussed together with mining algorithms. The section starts with categorical association rules and a summary of the Apriori algorithm, one of the most widely known rule mining algorithms. Further, quantitative association rules are introduced and a corresponding three-stage mining process is explained. A closely related issue, namely the application of quantitative association rules to mining survey data, is discussed separately in Section 2.5. Among other issues, ways of generating time-related and trend rules are highlighted. 2.2.1 Association rules Association rules were introduced by Agrawal, Imielinski and Swami in [2]. Let I be a set of items and D a set of transactions where each transaction T Í I . An association rule is an implication of the form X Þ Y , where X Ì I , Y Ì I , and X Ç Y = f . The rule X Þ Y holds in the transaction set D with confidence c if c % of transactions in D that contain X also contain Y . The rule X Þ Y has support s in the transaction set D if s % of transactions in D contain X È Y . In [2], the process of discovering association rules is decomposed into two steps: · Firstly, all sets of items with transaction support above a certain minimum support threshold are located. These sets are referred to as large itemsets. All other itemsets are called small itemsets. · Secondly, the generated large itemsets and a user-specified minimum confidence level are used to create a list of association rules. 2.2.2 Apriori In [3], Agrawal and Srikant published the Apriori algorithm for discovering large itemsets. Given a database of sales transactions where each item has a binary state (present or absent), the algorithm makes multiple passes over the data. In the first pass, the support of individual items is calculated and all small single-item itemsets are dropped. In each subsequent pass, a list of potentially large itemsets (called candidate itemsets) is generated from the itemsets found to be large in the previous pass, and the actual support for these candidate itemsets is calculated. At the end of the pass, it is determined which of the candidate itemsets are actually large, and they become the seed for the next pass. The process continues until no new large itemsets are found. To count candidate itemsets efficiently, a hash tree is used. 10 2.3. RULE INTERESTINGNESS 2.2.3 Quantitative association rules A quantitative association rule, or an impact rule, is a rule in which the left-hand side describes a subset of the population with a set of categorical variables, and the right-hand side consists of a quantitative attribute that has an unexpected mean value for this subset. Impact rules were introduced in [5] by Aumann and Lindell together with the corresponding rule generation algorithm. Aumann and Lindell point out that impact rules are easily understood and interpreted, even when they describe complex relations. In practice, information in most databases is not limited to categorical attributes but also contains much quantitative data. In [32], Srikant and Agrawal extended the categorical definition of association rules to include quantitative data. Their idea is to build categorical events from the quantitative data by considering intervals of the numeric values. They also provide an algorithm which approximately finds all rules by employing a discretization technique. Clustering methods can be used to improve the partitioning of the quantitative attributes in the algorithm [39]. In [5], Aumann and Lindell generalize the categorical definition further and suggest a new definition of quantitative association rules based on the distribution of values of the quantitative attributes. The authors note that generally speaking, an association rule is comprised of the lefthand side that is a description of a subset of the population, and the right-hand side that is a description of an interesting behaviour particular to the population described on the left-hand side. Thus, the general structure of a rule is “population subset → interesting behaviour”. Further, for categorical attributes, behaviour is naturally described by a list of items and the probability of their appearance. The authors argue that for a set of quantitative values, the best description of its behaviour is its distribution, and therefore choose to describe the behaviour of a set of quantitative values by calculating their mean and variance. A subset of the population displaying a distribution significantly different from that of its complement, either in terms of the mean or the variance, is recognized as interesting. Two types of rules are introduced. A (mean-based) categorical–to-quantitative association rule is of the form X Þ MeanJ (TX ) , where X is a profile of categorical attributes (set of attributevalue pairs), TX is the set of transactions with profile X, and J is a set of quantitative attributes. In quantitative-to-quantitative rules, instead of categorical attribute-value pairs on the left-hand side there are triplets (e, r1, r2) consisting of a quantitative attribute and two real values denoting the interval of allowed values for the attribute. Aumann and Lindell suggest using a three-stage process to find impact rules. In the first stage, all large itemsets are found, for example using the Apriori algorithm. In the second stage, the mean value of each quantitative attribute is calculated for each large itemset. In the third and most important stage, a lattice structure is built, linking together each itemset with all its fathers (immediate subsets) and sons (immediate supersets). X is a father to Y if and only if X Ì Y and Y = X + 1 . The lattice is then traversed and cases where the mean value between a set and one of its subsets is significantly different (such sets are known as exceptional groups) are reported as impact rules. To confirm the validity of a rule, the Z-test was employed. The authors’ model of an exceptional group is known as the separate effects model. 2.3 Rule interestingness Association rules represent important regularities in databases. They are found to be useful in practical applications. However, association rule mining algorithms (for example, the Apriori algorithm described in Section 2.2.2) tend to produce large numbers of rules most of which are of little interest to the user. Some of the rules have little business significance, others represent already known facts. Due to the large number of generated rules, it is difficult for the user to analyze them manually in order to identify those truly interesting ones. Typically, statistical measures are used as criteria for rule selection. 11 CHAPTER 2. RELATED WORK In this section, several existing approaches to assessing interestingness of association rules are described. All of them try to identify the most significant association rules from a set of rules generated by a mining algorithm. 2.3.1 Evaluating unexpectedness The approach to finding interesting rules from a set of discovered association rules taken by Liu et al. in [24] is in analyzing unexpectedness of rules. Rules are considered interesting if they are unknown to the user or contradict the user’s existing knowledge (or expectations). The proposed technique is characterized by analyzing the discovered association rules using the user’s existing knowledge about the domain and then ranking the discovered rules according to various interestingness criteria, e.g. conformity and various types of unexpectedness. The basic idea of this technique is as follows: the user specifies existing knowledge about the domain and the system then analyzes the discovered rules for unexpectedness based on the specified knowledge. To register the user’s existing knowledge about the domain, a special specification language is used. The language is able to represent facts of three degrees of preciseness: general impressions, reasonably precise concepts, and precise knowledge. The interestingness analysis system analyzes the discovered association rules using the user’s specifications and classifies the set into conforming rules (rules that conform to the existing knowledge), unexpected consequent rules, unexpected condition rules, and both-side unexpected rules. A visualization system is proposed as well. The discussion is held in terms of generalized association rules that are different from the original association rule model given in [2] in that they allow associations not only between nominally separate items, but rather between nodes in a taxonomy. Research by Jaroszewicz and Scheffer [20] is another example of assessing interestingness via unexpectedness with respect to a user-specified model. In their model, background knowledge is available in terms of a Bayesian network [28]. Bayesian networks can represent intrinsic dependencies between the included variables. Another advantage of specifying background knowledge as a Bayesian network is that such networks are relatively easy to understand when visualized. The algorithm presented in [20] uses sampling-based approximate inference in the Bayesian network to find the (approximately) most interesting attribute sets. 2.3.2 User-defined interestingness evaluator In [27], Obata and Yasuda describe a data mining apparatus for finding an interesting association rule from among a large number of association rules discovered through data mining by setting evaluation criteria on the association rules which differ depending on the user’s purpose. This involves creating a user-defined association rule evaluator that calculates an evaluation value for each rule. The rules are then sorted by this value and the re-arranged and limited rules are presented to the user. More specifically, the idea is to evaluate usefulness of rules. The authors observe that the value of association rules varies depending on how the rules are intended to be used. They suggest that the evaluation criterion should be based on the cost incurred upon applying the association rule, the profit gained when the association rule holds, as well as on the confidence and support of the rule. 2.3.3 Using users’ interactive feedback Xin et al. suggest a different approach to evaluating rule interestingness in [37], where the problem of discovering interesting rules through user’s interactive feedback is studied. The discussion is kept generic and is held in terms of patterns. The authors claim that in many cases other popular interestingness models, such as minimum support constraints or rule unexpectedness with respect to a user-specified model, fail to model 12 2.3. RULE INTERESTINGNESS the interestingness measure specified by the user really well: the minimum support constraint is often too general to catch the prior knowledge of the user, whereas the typical unexpectedness model requires users to construct a reasonably precise background knowledge explicitly, which is found to be difficult in many real applications. An alternative idea is to discover interesting patterns through interactive feedback from the user. Instead of requiring the user to explicitly construct the prior knowledge precisely beforehand, the user is asked to rank a small set of sample patterns according to his interest. From that feedback, a model of the user’s prior knowledge is created. Another small set of patterns is selected thereafter, and the model is refined using new feedback. This idea is illustrated in Figure 5. feedback User rank Sample patterns 1 Pattern 1 2 Pattern 2 ... ... Model of the user’s prior knowledge re-rank sample Frequent patterns Figure 5. Model for discovering interesting patterns from interactive feedback by Xin et al. Two tasks arise in conjunction to the proposed approach. Firstly, the model that the system uses to build a user’s prior knowledge should be defined. Secondly, it is desirable to find a balance between the number of sample patterns that need to be ranked by the user and the amount of information learned from the iterative learning process. Xin et al. discuss two models to represent a user’s prior knowledge. One is the log-linear model that works for itemset patterns only, and the other is the more generic biased belief model that is also applicable to sequential and structural patterns. In both cases, interestingness of a rule is calculated with the help of a weight vector of unknown variables and a set of constraints derived from the user’s rankings. It is further suggested that the resulting optimization problem can be solved either by using a support vector machine or by using one of the existing mathematical programming tools. The system should collaborate with the user in the whole interactive process to improve the ranking accuracy and reduce the number of interactions. The authors present a two-stage approach, progressive shrinking and clustering, to select sample patterns for feedback. Starting with two fundamental observations (one is that since similar patterns naturally rank close to each other, presenting similar patterns for feedback does not maximize the learning benefit and increases user overhead; the other is that since a user generally has preference over higher ranked patterns, the relative ranking among uninteresting patterns is not important), the authors elaborate an approach similar to that suggested earlier in [31]. Their idea is to divide the most interesting N patterns into k clusters, and to select the centre of each cluster to the sample feedback set. The number N initially equals the number of discovered patterns, and decreases at a constant rate with each iteration. Jaccard distance [19] is used for clustering. 2.3.4 Selecting an interestingness measure Numerous metrics are used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. All measures have different properties that make them useful for some application domains, but not for others. In [33], Tan et al. focus on classical association rules and interestingness measures defined in terms of frequency counts given as contingency tables. They present an overview of various 13 CHAPTER 2. RELATED WORK measures proposed in the statistics, machine learning and data mining literature. The authors describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using more than twenty of the existing measures. It is shown, inter alia, that most of the named measures agree with each other when support-based pruning is used in the process of rule mining. 2.3.5 Using click-through data In [21], Joachims proposes a method for evaluating the quality of retrieval functions that is based entirely on click-through data, unlike traditional methods that require relevance judgements by experts or explicit user feedback. In particular, the author focuses on comparing retrieval functions used in web search engines. The key idea is to design the user interface so that the resulting click-through data conveys meaningful information about the relative quality of two retrieval functions. With Joachims’ approach, the user is not required to answer any questions. Instead, the system observes the user’s behaviour and infers implicit preference information automatically. This is considered to be a key advantage, since click-through data can be collected easily, at low cost and without overhead for the user. Joachims recognizes that user feedback can provide powerful information for analyzing and optimizing the performance of information retrieval systems. However, in practice users are rarely willing to give explicit feedback. Moreover, especially for large and dynamic document collections, it becomes difficult to get accurate relevance estimates, since they require relevance judgements for the full document collection. Joachims claims that schemes for evaluating retrieval functions that only use statistics about the document collection and do not require any human judgements can only give approximate solutions and may fail to capture the preferences of the users. Joachims is inspired by Frei and Schäuble, who in [14] argue that humans are more consistent at giving relative relevance statements than absolute relevance statements. They recognize that relevance assessments are dependent on the user and context, so that relevance judgements by experts are not necessarily a good standard to compare against. Therefore, their method relies on relative preference statements from users. Given two sets of retrieved documents for the same query, the user is asked to judge the relative usefulness for pairs of documents in each set. These user preferences are then compared against the orderings imposed by the two retrieval functions and the respective number of violations is used as a score. While this technique eliminates the need for relevance judgements for the whole document collection, it still relies on manual relevance feedback from the user. Furthermore, Joachims means that although the resulting scores can be approximately the same, the perceived quality of the result sets can be quite different. Users would click on the relatively most promising links in the top of the list, independent of their absolute relevance. A list can have proper ordering while the overall quality of the results it presents may be low. Other statistics, like the number of links the user clicked on, are difficult to interpret as well. It is not clear if more clicks are due to the fact that the user found more relevant documents, i.e. indicate a better ranking, or because the user had to look at more documents to fulfil the information need, i.e. indicate a worse ranking. Joachims suggests the following experiment setup for eliciting unbiased data. The user types a query into a unified interface. The query is sent to search engines A and B. The returned rankings are mixed so that at any point the top links of the combined ranking contain almost the same number of links from rankings A and B (the two numbers may differ by 1). The combined ranking is presented to the user and the ranks of the links the user clicked on are recorded. If one assumes that users scan the combined ranking from top to bottom without skipping links, this setup ensures that at any point during the scan the user has observed almost equally many links from the top of ranking A as from ranking B. In this way, the combined ranking gives equal presentation bias to both search engines. 14 2.4. VISUALIZATION The above experiment setup is a blind test transparent for the user in which clicks demonstrate user’s relative preferences in an unbiased way. Furthermore, in the worst case, if one result set is perfect and the other is useless, the user needs to scan twice as many links as for the better individual ranking, so the usability impact is relatively low. Joachims analyzes the statistical properties of the click-through data generated according to his experiment setup. He makes two assumptions. The first is that users click on a relevant link more frequently than on a non-relevant link. The second assumption is that the only reason for a user clicking on a particular link is due to the relevance of the link, but not due to other influence factors connected with a particular retrieval function. This is reasonable if the abstract for each link provides enough information to judge relevance better than random. Under these two assumptions, Joachims shows that A retrieves more relevant links than B if and only if the click-through for A is higher than click-through for B (and vice versa), i.e. if the user clicks on the links from A more often than on the links from B. In [22], Joachims takes this idea further and presents a method for learning retrieval functions using click-through data. Click-through data in search engines can be thought of as triplets (q, r, c) consisting of the query q, the ranking r presented to the user, and the set c of links the user clicked on. As such, clickthrough data does not convey absolute relevance judgements. However, partial relative relevance judgements for the links the user browsed can be extracted from such data. For example, if a user after performing a search query is presented with a list of 10 links, and subsequently clicks on links 1, 3 and 7, it is logical to conclude that link 3 is more relevant to the user than link 2, and that link 7 is more relevant than links 2, 4, 5 and 6. This idea is formalized in the following algorithm for extracting preferences from click-through: for a ranking (link1, link2, …, linkn) and a set C containing the ranks of the clicked-on links, extract a preference example linki <r * link j for all pairs (i, j ) such that 1 £ j < i , i Î C and j Ï C . As this type of feedback is not suitable for standard machine learning algorithms, Joachims derives a new learning algorithm that can be trained with this type of relative feedback. For a query q and a document collection D, the optimal retrieval system should return a ranking r * that orders the documents in D according to their relevance to the query. Joachims uses Kendall’s τ to compare the observed ordering and the ideal one. In total, he arrives at a convex optimization problem similar to what support vector machines are designed to be able to solve. In practice, the ideal ordering r * is not observable, but it can be inferred from the observed preferences. Joachims adapts his optimization problem and uses a support vector machine to learn a ranking function. Joachims’ experimental results show that the algorithm performs well in practice, successfully adapting the retrieval function of a meta-search engine to the preferences of a group of users. In particular, in his experiments the ranking function trained on slightly over a hundred observations outperformed Google. 2.4 Visualization Several ways of visualizing association rules are presented in this section, together with a brief account of data visualization and visual data mining. A lot of research has been put into visualizing classical association rules of the type {Xi}→Y, where Xi are antecedent items and Y is the consequent item. Typically, at least the following five parameters are visualized: sets of antecedent items, consequent items, associations between antecedents and consequents, as well as the support and the confidence of the displayed rules. The two prevailing approaches to visualizing association rules are the two-dimensional matrix and the directed graph [36]. 15 CHAPTER 2. RELATED WORK 2.4.1 Two-dimensional matrix In a two-dimensional association matrix, the antecedent and the consequent items are positioned along the two axes of the graph, and an image in a cell depicts an association rule that links an antecedent with the corresponding consequent. Various attributes of the image correspond to various rule parameters, such as the support and the confidence. In Figure 6a, the association rule B→C is shown, and the support and confidence values are shown as columns positioned on top of each other. The two values are mapped to the height of the column segments. To visualize rules with several antecedents, antecedent sets instead of single items can be put on the one axis, as it is done in the commercial data mining product SGI Mineset. The situation is illustrated in Figure 6b. It is clear to see that this graph becomes more cumbersome as the size of the antecedent sets grows. 2.4.2 Directed graph When visualizing association rules in a directed graph, antecedent and consequent items are depicted as nodes, and the associations are represented as edges. For rules with multiple antecedent items special types of edge arcs are used. See Figure 6c for an illustration. The main difficulty with using directed graphs for rule visualization is that the graphs can become complicated even with a small number of nodes and edges [15]. A B C A' (a) B→C in a 2D matrix (b) A+B→C in a 2D matrix B' (c) A→C, B→C and A'+B'→C in a directed graph Figure 6. Conventional ways of visualizing association rules. 2.4.3 Other techniques Instead of using the tiles of a two-dimensional matrix to show item-to-item association rules, Wong et al. in [36] use a matrix to depict the rule-to-item relationship. The rows of their matrix represent items and the columns represent item associations. Antecedent items and the consequent for each column (i.e. rule) are represented by blocks of two colours. The confidence and support levels of the rules are given by bar charts in different scales at the top of the matrix. An illustration is given in Figure 7. The main advantage of this approach is that it allows rules with many antecedent items, while the identity of individual items within an antecedent group is clear. Also, no screen swapping, animation, or human interaction is required. All the metadata appear as separate columns at the edge of the matrix in a way that makes the rule clearly visible (in contrast with the standard two-dimensional matrix technique where rule columns can easily overshadow each other). In [9], Blanchard et al. argue that most known representations suffer from a lack of interactivity, meaning that knowledge discovery should be considered not from the point of view of a mining algorithm but from that of the user’s. Inspired by research on users’ behaviour in a knowledge discovery process and by cognitive principles of information processing in the context of decision models, the authors suggest that when faced with a large amount of information, a decision-maker focuses attention on a limited subset of potentially useful data and changes the sub- 16 2.5. MINING SURVEY DATA set of focus on until a decision is reached. Although based on this potentially interesting insight, the visualization scheme suggested by Blanchard et al. appears to be way too experimental to be useful in practice. Figure 7. Visualizing rule-to-item relationships as suggested by Wong et al. 2.4.4 Data visualization Advances in technology stimulate research on more advanced visualization environments. Nagel et al. also recognize the importance of interactivity and present a system for visual mining of data in virtual reality in [25]. The system includes several data exploration tools for visualizing four to five data set variables in a three-dimensional space. There are two principal views of the role of data visualization tools. In [1], Aggarwal suggests that such tools should be an integrated part of the data mining process and that they should be used for leveraging human visual perceptions on intermediate data mining results. According to the other perspective, data mining can use information visualization technology for an improved data analysis [23]. 2.5 Mining survey data In [6], Bennedich sets out to find an efficient automatic method of finding interesting trends and relations in a survey database, a large data set containing demographic, behavioural and attitudinal data. Attitudinal values are responses to Likert type questions (such questions allow the respondents to state their opinions using a scale) and are in effect ordinal data. Each record in the database corresponds to a specific business unit (a single hotel or an individual store for example). He concludes that Aumann and Lindell’s impact rules is an appropriate concept for this setting, as mining such rules requires relatively little involvement from the user, no initial idea of what to look for is necessary, and it yields results that can be readily understood and interpreted by non-technical users. Itemsets with small support Bennedich proposes several important modifications to the mining method by Aumann and Lindell [5]. In the context of Bennedich’s work, the term itemset corresponds to a customer group and the support of an itemset is defined as the number of customers in that group. It was desirable to be able to find very specific rules, and the minimum support level was thus suggested to be as low as two responses. To make this realizable is practice such small groups are 17 CHAPTER 2. RELATED WORK looked for only within individual business units. In the first stage of the algorithm, it is suggested to use the Apriori algorithm to find large itemsets within each unit. Restrictive model for rule unexpectedness When all large itemsets have been found and the statistical measures for them have been collected, they will be used to form impact rules. A lattice structure is built that links together each itemset with its fathers and sons. However, Bennedich argues that the separate effects model for finding exceptional groups (see Section 2.2.3) is inadequate, as when a group consists of more than one category and no category is completely dominant, the group mean is in general not expected to equal the mean of any of its fathers. Bennedich looks at the combined effects model by Chen [12], where the expected value μG for the cell G is calculated according to the following formula: lf = x mG = å lH H ÌG G lG = x - m G However, this model fails to calculate an adequate μG if one category dominates another, if two categories express the same thing and lack a combined effect, or if there is a combination of these events. Bennedich suggests his own restrictive model, in which an interval of values for the expected group mean is looked for rather than a single value. This interval is defined as [m min , m max ] , where: m min = min(m exp , m1 ,..., m n ) m max = max(m exp , m1 ,..., m n ) Here, μexp is the expected group mean with the combined effects model and μi are the means of the fathers. Let μact be the actual group mean. If μact < μmin or μact > μmax, the group is considered exceptional. This model is more restrictive than both the separate effects model and the combined effects model. Bennedich underscores that although his model will commit more type II errors (failure to detect existing relations) that the two models it is based on, there is also a guaranteed decrease in type I errors (finding spurious relations). Non-parametric statistical test When an exceptional group has been found, a statistical test is performed to verify that the observed difference is statistically significant. Neither the Z-test nor the t-test is suitable because of the potentially small sample sizes (both tests require the population to be distributed normally, a requirement that can be overlooked if the sample size is sufficiently large). Instead, it is recommended in [6] to calculate the exact p-value that the difference in mean values between the subgroup and the population is due to chance. This is possible because the quantitative attributes appearing in the consequent of impact rules can only assume a finite set of integer values. In this case, an efficient method exists for calculating the p-value exactly. It has a run time proportional to the sample size. When calculating the p-value, the null hypothesis is not that the means of two populations are equal, but that the mean of a population equals a given expected value. To accommodate for that, it is suggested to shift the exceptional group so that its mean becomes equal to the mean of the parent group. In the case when the exceptional group has several fathers, it should be tested against all of its fathers, and the final p-value is chosen as the maximum of all the computed pvalues. 18 2.5. MINING SURVEY DATA Trend rules Various types of time-related rules can be found by defining corresponding categorical attributes. For example, if survey data describes a customer’s visit to a store, the date of visit will probably be available. From this date, the categorical attribute Day of week (of visit) can be constructed and corresponding rules found by the methods described above. Another type of time-related rules is trend rules. Such rules concern contiguous time periods from some point in time until the most recent data (contiguous time periods up to a point in the past are found less relevant in [6]). Bennedich adds a trend category to the data set, which can take any non-negative integer value and shows the number of days between the relevant time parameter of the survey and the current point in time. An efficient method for finding rules concerning the last d days using the trend category and dynamic programming is described in [6]. In a nutshell, trends are analyzed by stepping back one day at a time. Furthermore, a method of selecting rules from a list of trend rule candidates is discussed. If a rule concerning data up to 14 days ago has been found, it is likely that similar rules will also be found for 13 and 12 as well as 15 and 16 days ago, etc. Bennedich chooses to report all trend rules that have the largest mean difference of all the rules with a p-value less than or equal to its own. Mean difference here is the difference between the mean response for the trend period and the mean response for the whole period. Distance between rules In [6], Bennedich also looks for a method of measuring distance between rules that would resemble the perceived distance. For the case of impact rules, standard distance measures based on the number of underlying data records (such as the Jaccard distance mentioned earlier in this text) are found of little use. Instead, the idea of calculating the distance between the rules from the attributes constituting them is explored. The suggested definition for distance is D = 1 - s , where s is the similarity function defined as s = c × d . In the latter formula, c is the similarity between the categorical attributes of the two rules, and d is the similarity between the quantitative attributes. Bennedich wants to minimize user involvement and settles for a fully automatic solution. Measuring similarity between two quantitative survey questions is relatively simple. Pairwise Pearson correlation coefficients between the survey questions are computed for the entire set of survey responses, and their absolute values are used as similarity values (they all are between 0 and 1). However, it is noted that if two questions correlate well it does not necessarily imply that they are perceived as being similar. For categorical attributes, a different method is necessary. When observing generated rule sets, Bennedich notices that the categorical attributes that are intuitively similar to each other tend to occur with the same set of survey questions in the resulting rule sets. Inspired by this insight, he uses Hebbian learning to build a profile for each categorical attribute that describes how strongly it is connected to each question. To convert two such profiles to a measure of similarity, the connection weights are converted to ranks, and the Kendall τ rank correlation coefficient is computed to compare the order in which questions are ranked. Negative coefficients are replaced by 0, and the final values are all between 0 and 1. Rule distance and interestingness of rules The case of several categorical attributes is not studied in general. Instead, the discussion is confined to the setting of memory-based reasoning. In that setting, it is necessary to be able to compare a query rule Rq with a rated rule Rr. Let 0 £ t £ 1 be the interestingness rating of Rr, let Cr be the categorical attributes of Rr, and let Cq be the set of categorical attributes of Rq. 19 CHAPTER 2. RELATED WORK Bennedich makes the assumption that if a set of attributes is considered uninteresting, it would likely remain uninteresting if more attributes were added. On the other hand, if a set was rated to be interesting, then each of its subsets, specifically each individual attribute, is likely to also be considered interesting. It is emphasized that although not always true, this assumption helps to create more accurate interestingness estimates as it is expected to hold in most cases. Then, if Cr is considered interesting, and if a categorical attribute q of the rule Rq is similar to any of the attributes of Cr, we would like to say that q is similar to Cr in order to allow Rr to have a larger effect on the interestingness of Rq. Hence, for each q, its most similar attribute in Cr is found, and the total similarity sq between Cq and Cr is computed as the product of all of these similarities. On the contrary, if Cr is considered uninteresting, it is difficult to say anything about subsets of Cr, since it is possible that some subsets are in fact interesting, but the combination of all attributes is not. Thus, in order for Cr to be similar to Cq and thereby for Rr to have a larger effect on the interestingness of Rq, it is necessary for each r Î Cr to be similar to some attribute of Cq. To sum up all of the above, if s(a, b) is the similarity between two attributes, the distance D between Rq and Rr is defined in [6] by the following formula: D =1- c × d d = similarity between quantitative attributes c = t × s q + (1 - t ) × s r sq = sr = Õ max(s(q, r) r Î Cr ) qÎCq Õ max(s(q, r ) q Î Cq ) rÎC r Bennedich points out that as more user ratings become available, it should be possible to refine the similarity measures used between attributes by comparing the predictions of the model to the accumulated user rankings and adjusting the similarity values accordingly. Memory-based reasoning vs. naive Bayes vs. neural network Further, Bennedich compares a naive Bayes classifier, a multi-layer perceptron neural network and the memory-based reasoning model based on the distance formulas above in terms of their ability to predict interestingness of rules from sample user rankings. He defines a synthetic user model that assigns interestingness ranks to rules based on how many other rules with the same attributes there are in the target database, and uses this model too see how well the three estimation techniques can learn it. In his experiments with this setting, the memory-based reasoning model proved to be superior to the two other methods both is terms of prediction accuracy and stability. 20 3 Design of the system In this chapter, the choice of the data mining technique is made, the requirements on the target rule mining system are listed and a workflow suitable for working with the system is outlined. Finally, we describe and motivate an architectural design for the target system in the context of these requirements and the suggested workflow. 3.1 General vision A general summary of the target system: what are we trying to achieve? Given a data warehouse storing customer satisfaction data, it is desirable to find rules describing exceptional, unexpected phenomena in the data. As suggested by the business context, the rules should link customer segments with satisfaction levels. It is necessary to develop a logical and intuitive workflow in this setting that will allow nonexpert users to work with the discovered rules. Among other things, an approach for estimating interestingness of rules is sought for. To make rules easier to understand, some form of rule visualization can be used. The goal is to come up with a comprehensive solution stretching from setting up data mining parameters to making the rules available for the end-user. 3.2 Choice of a mining technique Here, we motivate the choice of the core data mining technique. In essence, we choose to use an extension of Aumann and Lindell’s mechanism for mining categorical-toquantitative association rules. Aumann and Lindell’s categorical-to-quantitative association rules (Section 2.2.3) clearly constitute a suitable and direct solution to our problem, and their three-stage mining process can be used as is for rule mining. Furthermore, the context of Bennedich’s work (Section 2.5) is in fact identical to ours. We too want to discover rules with small support as well as rules describing a large population. The non-parametric statistical test can be used to support this. The reasoning behind the restrictive model for assessing unexpectedness of rule candidates is convincing. One difficulty with this method is that it is hard to explain to the users what the expected response value is: it is not trivial to understand why it is an interval of values of the form [m min , m max ] . This issue is discussed in more detail in Section 6.6. The concept of trend rules is highly useful. It was suggested by the project managers of the target application that this type of rules might be most interesting for the clients. However, we will use a different method of selecting rules from a list of trend rule candidates. Bennedich reports all trend rules that have the largest mean difference of all the rules with a p-value less than or equal to its own. This means that several trend rules for the same itemset (excluding the trend attribute) can be found. Doing so has several disadvantages. To begin with, more rules are generated, which places additional load on the system. More importantly, the user needs to analyze more rules, and this is directly incompatible with the general idea of keeping things simple for the user. Instead, we suggest that a single trend rule should be reported per itemset. The heuristic we use for selecting the one rule to report from a list of trend rule candidates is described in Section 4.2. 21 CHAPTER 3. DESIGN OF THE SYSTEM On the other hand, reporting several trend instances of the same rule increases the chance that the rule will not remain unnoticed. For example, one of the pruned instances could have a large mean difference and would pop up once rules were ordered by difference. As such, selecting a single rule from a series of candidates will inevitably reduce the amount of information available about the trend phenomenon. However, if this becomes an issue, the possibility of keeping the candidate rules in the system and using them behind the scenes while still showing a single instance to the user can be investigated. All in all, we settle on using Bennedich’s mining scheme with an updated trend rule selection procedure. Theoretically, decision trees (Section 2.1.1) could also be employed to search for rules, as each path from the root to a leaf makes up a rule. However, to build a tree, it is necessary to specify a target variable. Decision trees are most suitable for categorical target variables, while we are interested in rules with quantitative attributes in the consequent. Typically, decision trees are used more in a research context, for example when an analyst expects a connection between certain factors and wants to test whether it is supported by real data. Often it is necessary to change input parameters iteratively and study the trees over and over again to find something interesting. This data mining project has a different focus. We want to make rule detection an automatic process and let it find unexpected rules rather than prove or refute an existing hypothesis. Further, in practice decision trees are often technical in their presentation and are likely to discourage users who lack technical background, especially keeping in mind that the trees tend to get complicated when it comes to larger data sets with many parameters. A typical tree is “bushy” even after pruning, which makes the advertised simplicity and apprehensibility of decision trees look less convincing. In this sense, quantitative association rules are a more predictable alternative as they are consistently easy to understand. Still, using decision trees in the target application can prove useful. Studying the associated tree can give new insights into the nature of a rule. Also, trees can be used as an alternative way of looking at the data. However, we choose not to focus on decision trees further in this work. 3.3 System requirements This section lists more specific requirements on the target system from both the technical perspective and the user’s point of view. The focus is on what the system should be able to do and how the user is supposed to interact with the system. Below we present the initial requirements on the system that were found during requirements elicitation and the background study carried out in an early project phase, and explain why they were chosen as stated. Several topics to study are outlined. Emphasis lies on building a solid architecture to base the implementation upon and on finding a method of effectively determining the relevance of the discovered rules, ideally with as little explicit input from the user as possible. All Region 1 Business unit 1 Brand 1 Business unit 2 Business unit 3 Business unit 4 Figure 8. Grouping of business units in several levels. 22 3.3. SYSTEM REQUIREMENTS 1. Development of a sensible workflow in the setting described earlier. Considerations and desired features: a. Without loss of generality, let each satisfaction record concern a specific business unit. Business units can be grouped in various ways. Figure 8 illustrates grouping of properties in several levels. Following the design of the target application, let the grouping be hierarchical up to (and excluding) the lowest level, where single units can be included in several groups. The nodes are grouped in a way appropriate for the business. Further, let each user of the system have access to a single node in this structure. Note that this does not limit the allowed combination space as new complex group nodes can be introduced. b. Grouping of business units should be supported. i. Rule sets will be separate for the different levels. Following the example in Figure 8, it should be possible to find rules concerning Business unit 1 as well as rules concerning Region 1 as a whole. ii. Different rules will be shown to different users, depending on which nodes in the group structure they have access to. In particular, users directly responsible for one or several business units shall be able to work with rules associated with these units. iii. Users at higher levels should be able to explore lower levels rules. For example, a user who has access to Region 1 should be able to browse rules for Business unit 1 and Business unit 2. iv. It should also be possible to configure the display so that a summary of the most important rules on the next lower level is shown (by nodes, or in a combined list). The system will be targeted at non-expert users. However, having a functioning data mining tool opens possibilities for research, thus it is desirable to prepare for more advanced usage scenarios. c. Two rule viewing modes should be supported – simple and advanced. i. In simple mode, default settings will be used to display rules. ii. In advanced mode, the user will be able to sort, logically filter the rules as well as adjust the settings that determine which rules are to be displayed. For example, the user should be able to choose to see only rules concerning a certain satisfaction parameter at a specific unit. d. To let users quickly understand the rule, its basic characteristics should be made explicit early on. i. For each rule, the expected and the actual satisfaction level should be indicated. ii. Strength (for example, the corresponding p-value) of each rule should be indicated. In simple mode, the scale should be simplified (for example, it can be graded as “weak”, “medium” and “strong”). In advanced mode, a numerical representation can be used. e. To enable in-depth analysis of rules, the following features should be available: i. Textual representation of the rule, as a means of reducing the complexity of the notion of association rules and making them more intuitive. 23 CHAPTER 3. DESIGN OF THE SYSTEM ii. Appropriate diagrams that help visualize the rule. iii. It should be possible to explore individual satisfaction responses behind the rule. In particular, it might prove useful to study the text comments corresponding to the satisfaction records, if such are available. For a negative rule, for example, the responses without the comment should be filtered away, and the rest ordered by increasing satisfaction value. The idea here is that when studying the background of a low mean response, it helps to know what the people who were unhappy with the selected parameter had to say. iv. A list of other interesting rules should be suggested. It might be other rules relevant in the context of the given rule, rules related to the given rule in some way, for example concerning the same customer segment, a similar customer segment, etc. Depending on the nature of the underlying data, the number of possible categories, the number of satisfaction attributes and the significance thresholds used for mining, the number of discovered association rules can vary from few to thousands. To help cope with a potentially large number of rules, users can be given the possibility to hide uninteresting rules and to mark specifically interesting ones. f. It should be possible to mark rules as uninteresting. This can be implemented as a trash list where users can move irrelevant rules. g. Each user should be able to form a personal list containing rules that are highly relevant, so that these rules can be accessed at any time in order to catch up with latest changes in watched phenomena. i. Improvement and degradation of satisfaction levels should be indicated. A graphical representation is suitable. A graph showing development of the tracked value over time, coloured red or green depending on the current trend. ii. The point in time since when the rule is being watched should be clearly indicated. h. Too many rules should not be shown. However, it should be possible to see more rules upon demand. i. Trashed and watched lists should not disappear the next time rules are mined. A mechanism for distinguishing new rules from the old ones after the rule base has been rebuilt should be developed. For example, two rules with the same client segment and with the satisfaction value on the same side of the expected value could be considered the same. Actionable rules are of high practical value. It is important to help users act on discovered rules more efficiently by providing basic tools for sharing and discussing rules. j. It should be possible to share rules with other users/categories of users. (It remains to be decided how sharing will work between users and groups of users.) k. It should be possible to leave comments on rules. i. The comments will be visible to anyone who has access to the rule. ii. It should also be possible to assign a user to be responsible for tracing a certain rule. When deciding on the layout of rule lists, a typical email-client folder structure can be used as a starting point: 24 3.3. SYSTEM REQUIREMENTS · Inbox: Lists most relevant rules. When in advanced mode, the user should be able to sort and filter this list. · Trash: Rules that the user does not want to see again are placed here. · Watch: All watched rules are displayed in this list independently of their current difference from the expected value. · Shared: Shared rules. · Assigned: Rules assigned personally to the current user. As it was mentioned earlier, the number of rules can be large. Displaying all rules to the user may not be appropriate. In any case, it is desirable to find ways to automatically select rules that are potentially most interesting for the given user. 2. Develop an automatic filtering and ordering mechanism for displaying rules. a. Most relevant/interesting rules should be displayed first. b. Statistical significance of rules (the p-value, the difference from the expected satisfaction level, potentially other factors) should be taken into account. c. Can information about trashed and watched rules be utilized to deduct interestingness of such rules and rules related to them? d. Interestingness of rules on a certain level need not be the same for users at different levels. However, it is assumed that users at the same level share their opinion of what is interesting or not. e. Personal relevance preferences of a user can be weighted in together with other factors when resulting interestingness of a rule is being determined, e.g. with relevance preferences of all users at the same level. f. Can click-through analysis be used to further improve the filtering mechanism? One could study which rules are clicked on, and which are not, in what order rules are explored, and what click sequences look like. Another way of dealing with a large number of rules is to present the rules graphically in a way that would help to visually identify the most relevant rules at a glance. An additional possible advantage of a graphical representation is that it may be more effective in involving the user in rule analysis. 3. Suggest a method for displaying a set of association rules. a. Ideally, the solution should be interactive in a way that encourages further exploration of rules. b. If possible, a test on users should be conducted to see whether the suggested visualization method is effective enough. All of the above should be matched by a corresponding architectural solution. 4. An architectural design capable of accommodating the above features should be suggested. Among other factors, the following considerations should be taken into account: a. It should be possible to schedule mining. For example, it should be possible to configure the system to periodically mine data for a specific business. b. The solution should be scalable with the number of users and rules. c. The system should be adjustable to different businesses. A set of appropriate control parameters should be suggested. Examples of real-life figures (number of rules, number of users, etc.) that the target system should be able to handle are given for reference in Table 1. 25 CHAPTER 3. DESIGN OF THE SYSTEM Table 1. Reference parameters of the operational environment of the future data mining system. Parameter Number of rules Number of users Number of business units Number of unit levels Number of large customers (104 users) Number of small customers (10 users) Value 107 104 104 4 20 1000 3.4 Perceived complexity and workflow This section discusses the complexity of a data mining tool as perceived by users. We discuss ways of reducing the perceived complexity, in particular by creating a workflow that users can easily associate with and where the inherent complexity of association rule mining becomes irrelevant. When it comes to workflow and user features, this section serves as a complement to Section 3.3. In most user-operated software, and especially in a data mining application targeted at nonexpert users, it is of crucial importance to make the working environment intuitive and comfortable for the user. As we have experienced, people often perceive data mining as complex and intimidating. In the context of our application, we want to prompt a different view. We do not want people to think of data mining as of something frightening, rather as of a smart technique made for them, with the sole purpose of helping them solve their everyday concerns. As it was mentioned earlier, our application will in the first place be used by people who are by no means experts in data mining or advanced computer technology. An electronics retail chain is an example of a target client. Individual stores correspond to business units in Figure 8. Store managers are the low-level users of the system in the sense that they only work with data concerning their particular store. Impact rules will help store managers to uncover unexpected tendencies and target their improvement efforts, allowing timely and effective micro-management. Higher level managers will work with data from combinations of individual stores and will be able not only to study the situation at each store, but also to analyze their unit group as a whole. At a high level, company management often understands the need for this type of tools. Their main concern is to make sure that the technology is comprehensible and used by people at lower levels. At the low level, we will often find people busy with everyday issues who have their own worked-through routines, and it may not be straightforward to convince them that learning yet another technology will be worth the effort. In their turn, the developers of the system need to understand that introducing this type of product into a business is likely to change the way people work and even think about their work. It will create new routines and outdate some of the existing ones. It will push people forward. Hopefully, it will also open new possibilities to improve the service. To make this transition possible (to begin with) and smooth (ideally), it is necessary to make the users comfortable with the system from the very start. Basically, we want to hide the inner complexity of the system behind an appealing and easy-touse surface. Although the product can potentially be used for advanced analytics, delivering a research-friendly, data-intensive interface with a multitude of adjustable controls is not the way to go. As it was mentioned under System requirements, this should rather be implemented as an optional alternative interface. Instead, a straightforward, simple way to view the system and interact with it should be looked for. 26 3.4. PERCEIVED COMPLEXITY AND WORKFLOW One dimension in this simplification problem is the actual workflow. We want to support the user in everyday activities and make the results of data mining useful. The suggested general workflow consists of the following steps: · Set up data mining parameters. · Iteratively: o Mine data. o Explore results, act when necessary, and follow up in next mining runs. Setting up data mining is thought of as a single-time task carried out by an administrator. In practice, the setup will supposedly have to be tuned in several iterations as a deeper understanding of the underlying data and the needs of the users is accumulated. The mining step itself should ideally happen behind the scenes and be transparent for the user. From the user’s perspective, the set of rules changes automatically as new data comes in. Whether it is done continuously or periodically is not essential. Rather, it is the last step that is crucial, i.e. what the users can do with the rules and how the system supports them in this. Consider a familiar analogy. When faced with the task of picking out useful insights from a potentially interesting article, one would work through it, mark the most relevant or promising ideas, think about these closer, maybe discuss with co-workers. When the useful ideas have been selected, one might test them out in whatever setting they need to be applied in and see how it goes. It might prove necessary to divide the identified tasks between several people. Similarly to this, when a manager is faced with a number of rules discovered during data mining, he should be able to navigate through these, analyze those that seem potentially useful closer, mark the most crucial ones, hide those that are irrelevant. Some of the rules might need to be discussed with colleagues, either because they may be relevant for others, or in order to gather ideas. Further, something can be done about the rules that were found crucial enough (for example, by taking a series of improvement measures if a satisfaction parameter for a certain customer group appears to be too low). It should be possible to track such rules to see the effect of the applied actions. When the situation is back to normal, the rule is not tracked any longer. Thus, to facilitate working with rules, the following should be available: · Rule list, in which the user can navigate, sort and filter the available rules. · Rule explorer, where the user is taken upon focusing on a rule (for example, via a mouse-click in the rule list). Here, more information about the rule is provided, including graphs visualizing the rule and a list of related rules that might be interesting to investigate in conjunction with the explored rule. Data records behind the rule and other available context information should be reachable from this view also (for example, if survey data is mined, the surveys that the rule is based on, and especially text comments from these, can provide a deeper insight into the nature of the rule). · Watched rules list, where users can move those rules from the main rule list that they find highly interesting and want to follow up. Rules in this list remain here until they are explicitly moved back to the main rule list, which can be done at any time. The detailed overview and the graphs for such rules should take the latest available information into account, so that these rules are always up-to-date and reflect the latest changes in data. In this way, the effect of taken measures can be tracked closely. The list of watched rules is user-specific. · Another possibility worth exploring is to leave watched rules in the main rules list but indicate they are watched (for example, by marking them with a star – compare with the popular concept of “favourites”). This would let the users compare watched rules with the rest of the rules in whatever way made possible by the GUI. For example, sorting the rules by a specific parameter makes it easy to compare starred rules with the rest. 27 CHAPTER 3. DESIGN OF THE SYSTEM · Trash list, personal for each user, where users can move those rules from the main rule list that they do not want to see again. For example, rules involving meaningless combinations of attributes and rules describing well-known phenomena can be moved to the trash list. Such rules will never be shown in the main rule list again. · It should be possible to leave comments on rules, and these should be available from the rule explorer. It can be practical to let the main rule list on demand give a summary of the latest comment updates for each of the shown rules. · Users at a higher level should be able to assign watched rules to users at lower levels. For example, if a region manager notices an important rule concerning a certain business unit, it should be possible to put the rule on the watched list of the manager responsible for that business unit. It is believed that by providing the user with the abovementioned tools we induce a workflow that the user can accept and associate with, that puts the unfamiliar concept of impact rules in an understandable and actionable context, and that transforms the complex data mining machinery into an instrument of everyday use. Another way to reduce the perceived complexity of the system is to critically review the vocabulary of the user interface and replace technology-related terms with suitable easy-tounderstand metaphors. In particular, it might be a good idea to abstain from using the term “data mining” altogether and use a more universal language. For example, if the system mines unexpected patterns (rules), one might put it as finding new insights rather that searching through a set of categorical-to-quantitative association rules mined from the latest available data. Figure 23 shows an example summary of the most relevant rules labelled as “insights”. 3.5 System architecture A general architectural design for the rule mining system is described in this section. The subsections give details about the main components of the system. In view of the requirements listed previously, and in order to strengthen logical separation of the different stages of the rule mining process, it is suggested to split the rule mining system into four main components: · Application server runs operations specific to the host application. In particular, rule handling GUI is implemented here. · Data mining engine (DME) is responsible for mining satisfaction data. · Rule server (RS) is the provider of mined rules and the host of the rule ranking module. · Control module coordinates mining of data for different businesses. The suggested architectural design is visualized in Figure 9. The architectural diagram is explained component by component in the subsections that follow. The general flow of events in the diagram is as follows: · Administrator sets up mining preferences for each customer (i.e. business) via the control module, as well as settings and scheduling for MiningDataDistributor for each customer on the application side. · MiningDataDistributor sends changes in underlying data to the Data fetcher on the corresponding DME server. · Based on scheduling preferences and available log data, Controller asks Miner macro to start mining when the time is right. 28 3.5. SYSTEM ARCHITECTURE · Miner macro lets Miner micro mine rules for each unit group. At any time this process may be interrupted by the user via Controller’s control interface. · When rules for all unit groups have been mined, Miner macro reports success to Controller. · Controller notifies Rule agent of available rule data at the DME server, and Rule agent asks Data fetcher to fetch the rules. · Data fetcher receives the mined data (together with a classification of unit groups in a form useful for the ranking module; this is to indicate which unit groups share feedback that is relevant for interestingness) and updates the internal databases. · Changes in Watch and Trash are continuously reported by the main application to the Rule agent. Ranking is updated accordingly. · Occasionally, the main application also reports rule usage statistics to the rule server. · Rule server answers rule queries received from the main application. Rule server Which unit groups are considered "the same". Data mining engine Satisfaction DB changes in data and unit groups Data fetcher data rules Miner micro groups status Rule DB rules and unit groups Rules per unit group. classified unit groups Unit groups rules Rule DB CT & CAS Usage DB WT events Filtering DB Data fetcher control control status status Miner macro Ranker Rule agent control control query ranked rules changes in satisfaction data and unit groups Watch and Trash: rule structure, per user. Control module status Scheduling DB rules DME and Rule server log schedule Application server status Controller Click-through data, comments-assigned-shared statistics, per class (or user). MiningDataDistributor Application write to log Log DB update schedule DB Watch schedule Receptionist log preferences, actions Comments Assignments Application DB Sharing status, preferences Control GUI Figure 9. Architectural diagram of the rule mining system. 3.5.1 Application server Application server is responsible for running the main application in a way defined by the specification of the host system. Ideally, in order to separate the mining module from the application using it, no modifications should be made on the application side when integrating the new module. In reality, however, it is difficult to achieve this at least for two reasons. Firstly, additional independent components increase overall system complexity. Also, the host application does not necessarily provide an 29 CHAPTER 3. DESIGN OF THE SYSTEM API allowing external access to its data. In the latter case the host system needs to be modified anyway. The following support is required on the application side. · According to the schedule specified by the administrator, the application delivers data necessary for mining to the corresponding DME server via MiningDataDistributor. · Depending on data mining settings, the amount of data to be mined may vary. In order to reduce the load on the application, it is suggested to send data to the DME incrementally, thus eliminating the need of sending over all data each time mining is scheduled. · MiningDataDistributor is responsible for converting the data to the format required by the data miner. This format should be independent of the application’s data model. Support for the following activities are specifically necessary: o Separate demographic/behavioural attributes from attitudinal attributes and free text data. o Create custom attributes. For example, if the attribute “Check-in date” exists, but not “Check-in day-of-week”, it will be necessary to create the latter in order to be able to discover rules concerning specific weekdays. o Some quantitative behavioural attributes may need to be partitioned to make the mining more efficient. For example, the attribute “Price paid” for a service can take a lot of values, and rather than splitting the mining on every possible value, it is advised to partition such a question into 5-10 groups, e.g. $0-15, $15-40, $40-60, etc. · There should be a GUI allowing authorized users to adjust scheduling and other settings concerning MiningDataDistributor that are independent of the control module. · Watched rules, comments on rules, rule assignments (to specific users), as well as information about sharing of rules can all be primarily kept on the application side rather than on the rule server. · o In this way, this data will be available instantly for the application, and there will be no need to query the rule server. It might be desirable to display assigned and watched rules to the user often, and no ranking of these rules is necessary as they should be few. o When any of the above are changed, the rule server should be notified. This should be done instantly in the case of changes watched rules. When the rest is changed, it may be more appropriate to bundle change notifications with clickthrough data and occasionally send this information to the rule server in form of rule usage update blocks in order to optimize system performance. GUI for working with rules should be integrated into the application to allow for a logical and natural workflow from the user’s point of view. This will also result in a clearer separation of responsibilities between the different components of the system. 3.5.2 Data mining engine The primary task of the data mining engine is to mine data. However, additional components were suggested to ensure compatibility with the rest of the framework. · Satisfaction data is stored to allow incremental data transfer from the main application to the DME. Thus, data can be mined at any time without placing additional workload on the main application. · A dedicated component receives incremental updates of the satisfaction data from application servers. 30 3.5. SYSTEM ARCHITECTURE · When requested to do so by the control module, the DME runs data mining on available data for the specified client, saves mined rules and reports back to the control module when the job is done. · As mining itself can be time-consuming, mined association rules are stored so that they can be re-sent if errors occur during communication with the rule server. · The DME returns the latest set of mined rules for the given client upon request from the rule server. As it was mentioned under System requirements, each satisfaction record concerns a business unit. Candidate rules concerning different business units can be mined separately. Thus, it is reasonable to use several independent threads to mine different units. However, candidate rules will then be compared to see whether the described phenomena are indeed exceptional. This can be done when candidate rules for all units have been found. Another consideration is that business units can be arbitrarily grouped into unit groups. Arbitrary unit groups can be mined by analogy. Satisfaction records for all units within the given unit group can be marked as belonging to this group, and then the normal mining procedure can be used. Clearly, information about which unit groups are comparable must be sent from the main application to allow cross-group analysis to determine whether the discovered candidate rules are exceptional or not. 3.5.3 Rule server Rule server is a dedicated subsystem that provides rules to the main application upon request. Keeping in mind the potential number of users, clients and rules (Table 1), it is believed that this design is more appropriate than keeping the rules on the DME side. Another important task of the rule server is to rank-order rules based on rule usage data. This can be resource intensive. Due to this and the aspiration to keep the mining components and the application mutually independent, it was decided to keep this functionality separate from the main application. Below is a general summary of what the rule server should be able to do. · Retrieve mined results from the corresponding DME server when prompted by the control module. · Data received from the DME server consists of new rules and a classification of unit groups. Each rule is associated with exactly one unit group, and each unit group belongs to one class. In this way, users’ feedback can be analyzed separately for each class. · Receive changes in the lists of watched and trashed rules from the application and update the database accordingly. · For every user it shall be possible to see which rules are watched and which are trashed. These lists should not get outdated when new rules arrive from the DME server. · Receive click-through and comments-assignments-sharing statistics from the application and update the database accordingly. o Comments-assignments-sharing statistics is information about how many times a specific rule has been commented upon, assigned to users, and shared between users. o This information can be classified according to unit group class. It is also possible to store this information on a per-user basis. This way, we would be able to make more accurate personal preference rankings, but this would also complicate the interestingness model. 31 CHAPTER 3. DESIGN OF THE SYSTEM · Together, data from Filtering DB (rule filtering database based on watched and trashed rules) and Usage DB (a database with click-through and comments-assignments-sharing statistics) can be used to determine collective and personal interestingness of rules. · In Filtering DB, the watched and trashed rules should be stored together with their unit group, structure and the corresponding user id. In this way, it will be possible both to filter rules independently for each user and to deduct collective intelligence values per unit group class. · Answer rule queries from the main application. A typical query would be “show Inbox for user X, for unit groups {Xi}”. It is plain to see from the mining algorithm that when data is mined, rules are calculated from scratch in the sense that no information about existing rules is used. This means that when a new set of rules arrives at the rule server, the new rules bear no direct connection to the existing rules. Still, information about watched and trashed rules has to be preserved and useful when new rules arrive. Thus, it is necessary to define a rule filter equivalence criterion. We suggest that the antecedent and the consequent together with the sign of difference from the expected value should be used for this purpose. Thus, if a rule is put to the trash list, it will be assumed that user means that any rule with this antecedent and this consequent that on the same side from the expected value should be hidden. By looking at the sign of the difference from the expected response value we separate the positive and negative phenomena. However, this method is crude. Consider the situation when the user trashes a rule with a low difference and then it grows remarkably in a subsequent mining. The rule will still be hidden although it might be of interest. To solve this, the users can be given the opportunity to specify if they trash a rule permanently or if the system should still display the new rules that are remarkably different in statistical measures from their trashed counterpart. For users with access to higher unit groups, the question arises whether the unit group should be taken into account when filtering out trashed rules. We feel it should and choose not to hide corresponding rules at other nodes. Interestingness estimation mechanisms can further ensure that rules whose counterparts are trashed at another unit group receive a lower rank. Trend rules require separate consideration. It is necessary to decide how the rule “unit A, floor B, last 14 days → water pressure+” will compare to an identical one but three months later, or to a similar rule concerning the last 7 days. One solution is to look at the presence or absence of the trend attribute in the rule rather than its value. This also goes along the same line as the idea to allow a single trend rule per itemset. Note that in another setting an alternative scheme may need to be chosen. The above is consistent with the way rules on the watched list have to be treated. Each watched rule is tied to a specific unit group, and watched trend rules are followed up nearly continuously, so the exact number of days when the most extreme results have been observed becomes irrelevant as compared to the number of days the rule has been watched. As such, updated counterparts of the watched and trashed rules are not necessarily present in new rule sets. While it is perfectly suitable for trashed rules, watched rules need to be kept upto-date. The mining algorithm should not prune rule candidates corresponding to the rules on the watched list. Another consideration regarding replacing the rule set is that this process should be completely transparent to the users of the data mining module in the host application. All the necessary updates in the internal databases and memory structures should not interrupt normal operation of the system as seen from the user’s point of view. To achieve this, the new structures can be prepared “offline” first and when they are ready, the old structures of the rule server can be replaced in a single fast step. The update can also be incremental, i.e. the structures can be updated in the order that does not affect operation of the system. In practice, a combination of the two techniques can be most appropriate. 32 3.5. SYSTEM ARCHITECTURE 3.5.4 Control module The task of the control module is to coordinate mining of rules for different clients. Creating a separate component for handling this task is suggested by the potentially large number of clients. In this setting, it should be possible to specify the frequency of mining for each client, and let the system take care of the rest of the scheduling. The control module is supposed to do the following: · For each customer, hold a schedule of data mining runs. · For each customer, log statistics about past runs. In particular, execution time should be recorded. This information can be used for smarter scheduling. For example, when two jobs on the same DME server need to be executed in sequence, it might be reasonable to prioritize the job that has previously taken less time. · Start DME runs according to the schedule. When notified of success by a DME server, notify the corresponding rule server of availability of fresh data mining results. This opens the possibility of scheduling rule server updates separately. · Provide a software interface for controlling the schedule and ongoing runs of the DME. This interface can be utilized by an external GUI allowing the user to monitor and influence DME executions. It might also prove helpful to be able to steer the control module from the main application. 3.5.5 Discussion Here we list some of the design considerations that were not covered explicitly in the previous sections. · Transfer of data for mining (between the application and the DME) and runs of the DME are independent so that scheduling can be optimized according to the use of the application and the use of the DME, respectively. For example, if an application instance is most heavily used during daytime, data transfer may be scheduled every night. Runs of the DME, on the other hand, can be scheduled according to the number of all clients that need to be mined by the current DME instance. · When devising the architectural design, several ways of distributing responsibility between the main application and the rule server were considered. They are illustrated in Figure 10. Rule server Rule DB Rule agent Application server Rule server Rule DB Application Rule DB Rule agent Rule service Application server Application server Main part Application Application DB Application Application DB App DB Rule DB Figure 10. Three ways of dividing functionality between the main application and the rule server. 33 CHAPTER 3. DESIGN OF THE SYSTEM o One option was to move all rule-related functionality to the rule server. However, it could potentially result in too much communication between the application and the rule server, in view of a potential necessity to access lists of watched rules often. o Another possibility was to integrate all rule server functionality into the application. In this case, there would be no need for external communication, and the whole design would be simpler. However, as it was unclear how resource intensive rule ranking would be in practice, such a design could potentially result in a system too slow for practical application. o Finally, different ways of sharing the responsibilities were considered. With the current design, all ranking is done by the rule server and does not burden the main application; yet the most immediate data (in particular, lists of watched rules) is available on the application side. · The control-status connection between Controller and Miner macro allows Controller not only to start data mining, but also to ask for status, pause, resume and cancel the mining process. · The control module and all application servers must know which DME server and which rule server they should contact. To avoid duplication of settings, let the control module notify the corresponding application of any changes in the two addresses. · Security issues associated with communication of satisfaction data and mined rules between the components of the system are not looked into in any further detail, as it is implied that the components reside in an internal network of the application host. The setup of this network should guarantee that the individual components except for the main application are not reachable from the outside. The suggested design is scalable in the sense that the number of DME servers, rule servers and application servers may vary independently from one and up. However, the control module will exist in a single instance. In theory, the components can physically reside on the same machine. It is recommended though to run DME servers on separate physical machines since data mining can be highly resource-intensive. Also, each DME server should carry out one mining task at a time. Like an application server, a rule server may be configured to work with several clients. It is understood that the design outlined in this chapter is relatively complex. Our belief is that the complexity of this design is justified by the fact that it covers a whole range of practical use cases. When applied in a specific setting, the design can be adapted and simplified. 34 4 Interestingness models This chapter introduces and motivates two models for rule interestingness calculation, namely impact that is based on diff and support, and hotness that combines statistical significance of a rule with collective intelligence of all users and personal preferences of the current user. As suggested by the variety of existing approaches to rule interestingness estimation (some of which are described in Section 2.3), it is scarcely realistic to find a single universal interestingness measure that captures all the factors relevant to rule interestingness in a perfectly balanced way. In this chapter, we will outline the desired characteristics of the sought-for interestingness model and suggest how these can be fit into two interestingness values. For each mined rule, the expected satisfaction value calculated in accordance with Bennedich’s restrictive model (Section 2.5) is known. Also, the rule’s support (the number of responses the rule is based on) and p-value are available. Let diff denote the difference between the actual mean value and the expected satisfaction value. 4.1 Impact Recall that unexpectedness of rules is one of the most popular interestingness factors (Section 2.3.1) and the one explicitly named under General vision. Further, the idea of deriving interestingness by means of a user-defined evaluator (for example, as the difference between the cost incurred upon applying the association rule and the profit gained when the rule holds) was suggested before (see Section 2.3.2). Rule impact defined below is a measure combining unexpectedness of the rule with its value in the business context. Diff clearly gives a measure of the level of unexpectedness of a rule, especially considering the fact that the calculation scheme of the restrictive model by design can understate this value. Thus, a high absolute value of diff potentially indicates a highly exceptional rule. However, high diff values are typically achieved in rules with low support, i.e. in rules concerning few respondents. Moreover, on a business scale, it is more interesting to study rules that are representative for the whole population, i.e. rules that are based on many responses. To account for the above, the impact of a rule can be defined as: impact = diff × support In order to separate the cases of positive and negative rules, it is recommended to use a simpler definition instead: impact = diff × support As simple as it is, this calculation is rather powerful. It captures a business concept and gives a concrete value that is clear and simple to understand and interpret. It is easy to compare rules by impact and choose which to act on. Moreover, the concept of impact makes it easier to come up with an estimation of how much effort the business is ready to put into dealing with the phenomenon the rule describes. Depending of the absolute value of impact, a target budget can be estimated. To account for this, another calculation step can be added on top of the impact: cost = f (impact) Furthermore, business units can be compared by the total sum of the costs of their rules, alternatively by the total impact of their corresponding rules. 35 CHAPTER 4. INTERESTINGNESS MODELS However, it is recognized that impact is a crude measure. It only gives a general picture and clearly takes away the focus from local rules, i.e. rules concerning few respondents. On a single business unit level, it can be desirable to notice and react to local changes in satisfaction. For example, if there is a problem at a certain room at a certain hotel, it is preferable to find out about it after it has been reported by two guests rather than two hundred. In addition, it is not apparent that the interaction between diff and support used in the impact formula is adequate. Also, impact is a static measure. It does not change after users’ preferences and does not adapt to users’ behaviour. On the one hand, this stability can be considered positive. In this sense, impact is a reliable measure. On the other hand, if a rule with a high impact represents a wellknown fact and is not perceived to be unexpected, it will still be put high in the ordered rule list. 4.2 Pruning trend rules The concept of impact is simple and effective. It also proved to be useful for the problem of selecting a single rule to report from a set of trend rule candidates. In Section 3.2, we mentioned that even when fixed values for minimum support, minimum diff and maximum p-value are specified, several trend rule candidates can be found for the same antecedent. We chose to report only one of these to the user in order to avoid the confusion anticipated if all trend rules satisfying the threshold criteria were shown. We rely on the intuition that in a series of rules with the same left-hand side and the same quantitative attribute that date back to different points in time, the most relevant one will have a sufficiently large impact and a sufficiently small p-value. For a trend rule, let its effect denote the following: effect = diff × support p In our experiments with various selection heuristics (rules were randomly chosen among the trend rules found by mining our test database; for each of the test instances, all found candidates were presented to the user), we saw that the trend rule candidate with the highest effect almost always coincided with the rule that would be chosen by a human expert. Notice that using effect as an explicit interestingness measure available in a GUI is a worse option than using impact. This is due to the fact that impact is easy to understand for an average user, while effect includes dividing by the p-value which makes the result difficult to interpret. 4.3 Diff vs. p-value Let us consider an alternative way of assessing rule relevance through objective values. This time we will concentrate on the diff and the p-value. Recall that the p-value as reported by the restrictive model is the maximum of the p-values calculated with the null hypothesis that the (possibly shifted) mean satisfaction value of the itemset in the rule is equal to that of its father in the itemset lattice built according to the mining scheme by Aumann and Lindell (consult sections 2.2.3 and 2.5). Somewhat simplified, the p-value is the probability to get a subgroup from the parent population as extreme as in the observed rule provided the rule is actually valid. The following can be said about the p-value and the diff: · Rules with a significant diff should be considered relevant, as well as rules with a low p-value. · The two factors should be able to compete: a large diff should be able to overshadow a higher p-value, and a low p-value should outdo a small diff. 36 4.3. DIFF VS. P-VALUE · The two factors must be balanced properly, as the same absolute difference in one of them does not directly correspond to a difference in the second one, and the scales are different (the p-value is a probability and varies between 0 and 1, while abs(diff) can vary between 0 and the maximal allowed satisfaction value). The above suggests that the resulting significance value could be calculated by: relevance= f (diff ) g ( p) To elaborate the model further, assume that we could map diff and p separately to two relevance values between 0 and 1, where 0 means “totally irrelevant”, 0.5 means “indefinite”, and 1 means “most relevant”. Then the product of the two mappings would be a value between 0 and 1, where 0 means “totally irrelevant”, 0.5 means “indefinite”, and 1 means “most relevant”. When comparing rules with different p-values, the difference in the orders of magnitude is typically interesting [6], while diff compares well on a linear scale. Moreover, a cut-off point after which all smaller p-values are considered equally good can be used for simplification. [ ] [ ] In the case of p-values, the desired mapping could look like 1...10 -3...10 -6 ® 0... 0.5...1 , or in the generalized form: [1...mid p ...min p ] ® [0...mid...1] . Thus, in order to map the p-value to relevance, the log function can be used first, and then the power function can be used to fit the target mapping into the control points (1, 0 ) , (mid p , mid ) and (min p , 1) . We arrive at the following mapping: , x < min p 1 ìï rel p ( x) = í a ïî log min p x ( ) , x ³ min p , where a = log (log min p mid p ) mid This mapping is illustrated in Figure 11a. The figure corresponds to the parameter values min p = 10 -6 , mid p = 10 -3 and mid = 0.5 . 1 0.9 0.8 0.7 0.6 0.6 relevance relevance 0.8 0.4 0.5 0.4 0.3 0.2 0.2 0.1 0 -6 10 -4 -2 10 10 0 10 0 0 p-value 0.5 1 diff 1.5 (a) Mapping from p-value to relevance (b) Mapping from diff to relevance 2 Figure 11. Converting p-value and diff to separate relevance values. For diff, the situation is different. In the context of the target application, a mapping of the form [0...x0 ...x1...xM ] ® [0...y0 ...y1...1] was found reasonable. After a certain point (x1 , y1 ) the absolute change in difference has a relatively small impact on the relevance value, as the diff has already become large enough. On the interval [0...x0 ...x1 ] ® [0...y0 ... y1 ] , the power function is used to fit the curve to the control point (x0 , y0 ) , analogously to the p-value mapping described above. The resulting formula is: 37 CHAPTER 4. INTERESTINGNESS MODELS æ y0 ö ì log æ x ÷ öç ï y æç x ö÷ log çè 0 x1 ÷ø è y1 ø , x < x1 1 x ï è 1ø reld ( x ) = í ï y + (1 - y )lgæç1 + 9 x - x1 ö÷ , x ³ x 1 1 ç ï 1 xM - x1 ÷ø è î Figure 11b shows the mapping of diff to relevance corresponding to the control parameters (x0 , y0 ) = (0.1, 0.5) , (x1 , y1 ) = (1.0, 0.9) and xM = 10.0 . Finally, the total relevance value can be calculated as: relevance= reld ( diff )× rel p ( p ) An obvious disadvantage of the suggested formulae is that it is necessary to explicitly choose the control points. However, once these points have been chosen, the resulting relevance value reacts in an intuitive way, boosting interestingness of rules with a low p-value and a high diff independently, and vice-versa. Again, this scheme is not universal either, but it serves as a unified statistical significance value that is based on two important statistical parameters and helps to simplify the overview analysis of the mined rules carried out by the user. 4.4 Collective interestingness As it was emphasized in Section 2.1.7, collective intelligence is an increasingly popular concept in application construction. Applications can be adjusted and personalized by learning from their interactions with users. Information about subjective interestingness of rules can be extracted in various ways from interactions between the users and the system. However, explicitly asking the users to rankorder sample lists of rules as it was suggested in [37] (see Section 2.3.3 for a summary) is undesirable in our case. On the one hand, the aspiration is to make the ranking scheme transparent for the user. On the other hand, asking users to specify an absolute ordering can be unreliable [14]. Instead, there are other sources of preference information that are less explicit. For example, click-through is clearly a source of implicit preference data (see Section 2.3.5). In this section, we will concentrate on how interestingness can be deduced from users’ moving rules to the watched and trash lists. The two actions can be seen as classifying rules as interesting and irrelevant, respectively. Thus, each such event assigns a rule into one of the two categories. If a distance between two rules is known, a nearest-neighbour approach can be used to deduce interestingness of all rules. Two difficulties arise, however. First, the number of rules can be large (Table 1), and nearest neighbour techniques are as such slow when the number of nodes grows (Section 2.1.5). Secondly, and most importantly, when data mining is re-run, the old rules are replaced with new ones, making the old rating information useless. To deal with the first difficulty, rules could be clustered in a way that lets quickly identify clusters of rules that are too far away from the target rule and thus need not be processed. The second problem, however, suggests that rules should not be involved in the analysis directly. Instead, a more generic structure that persists between mining runs should be utilized. The antecedent of the rule together with the quantitative attribute can be such a structure. However, rules with positive and negative diff are likely to be perceived as principally different. Compare the cases when a satisfaction value is found to be too high to the case when it is lower than expected. The second case will probably be seen as more urgent, as the clients may be lost. The first situation, on the other hand, is in the worst case a sign of an unnecessary cost, which is 38 4.4. COLLECTIVE INTERESTINGNESS potentially not as harmful. Thus, we may conclude that it is reasonable to include the sign of diff into the rule structure. The next question is whether to include the values of the categorical attributes constituting the left-hand side of the rule into the rule structure. Including categorical values enables fine-grain interestingness judgements. At the same time, it often adds unnecessary complexity into relevance analysis. Should similar rules considering different rooms be treated differently? Should rules concerning men and women be seen as principally different? Also, such separation calls for a way of quantifying the difference between different categorical values. How different should people with an annual income of 0 to X, X to Y and Y to Z be considered? One way of dealing with this additional complexity would be to add a fixed penalty for a difference in categorical values. Instead, we choose to define the structure of rules as the categorical attributes constituting the left-hand side of the rule, the quantitative attribute on the right-hand side, and the sign of diff. This definition of rule structure is used further in the current paper and is called the rule type. The conception of rule types solves the two difficulties mentioned above. The number of rule types is in practice significantly lower than the number of rules. In particular, note that if the number of categorical and quantitative attributes is nc and nq respectively, and ci is the number of possible values of the categorical attribute i, then the number of rule structures with categorical attribute-value pairs, quantitative attribute and the sign of diff NRS and the number of rule types without categorical values NRT are: nc N RS = 2nq Õ (ci + 1) i =1 N RT = nq 2nc +1 Also, old rule types remain unchanged when new rules are mined, so the collected feedback is still useful with new sets of rules. The solution presented in this chapter can be adapted to the case when the rule type includes values of categorical attributes if necessary. In [6], Bennedich studies an identical setting and advocates that memory-based reasoning is a good choice for subjective interestingness estimation outperforming a naive Bayes classifier and an artificial neural network. He includes categorical and quantitative rule attributes into the rule type and elaborates a function that returns similarity between a rated rule and an arbitrary rule based on the corresponding rule types. This definition is easy to extend to the case of signed rule types by seeing rule types with different signs as completely dissimilar and leave the calculation unchanged for rule types having the same sign of diff. In his experiments, however, Bennedich assumes that absolute ratings for specific rules are available. In our setting different users can implicitly leave ratings of either 0 (irrelevant) or 1 (relevant) to potentially the same rules. In what follows, we will try to adapt our setting so that the k-nearest neighbours algorithm can be used for collective interestingness estimation. Here are the core properties desirable for the future calculation scheme: · Raw rating of a rule type (rating without impact from neighbours) should be based on the number of times the rule type appears on the watched and trash lists. In principle, other factors could be included here as well. We choose to leave such extensions for later. · The more feedback there is for a rule type the more it should affect its neighbours. When a rule type has two neighbours, one of which was rated by ten users and the other one by two, the first score is probably more reliable, and should have a larger effect than the second rule type. · The weighting should be optimistic in the sense that moving to the watched list should weight more than moving to trash. The basic motivation for this idea is that if somebody thinks that a rule is highly relevant and a few more users think it is useless, 39 CHAPTER 4. INTERESTINGNESS MODELS chances are it in fact is interesting. However, if users consistently trash a particular rule type, it really is likely to be useless. On the whole, to trash an interesting rule is worse than recommending an uninteresting rule. Recall the formula for the weighted k-NN from Section 2.1.5: å xi × w(di ) å w(di ) Here, the weight function w(d) decreases with distance. In our setting, the algorithm should average ratings and weight in the number of votes. Let wn(n) be the weight function of the number of responses, wd(d) be the weight function of distance, and n(t) be the number of responses corresponding to the rule type t. Further, let d(t1, t2) be the distance between the rated rule type t′ and the target rule type t, let ratingraw(t) be the raw rating of the rule type t, and knn(t) be the k nearest rated neighbour rule types of t. Finally, let rating(t) be the total rating of t that includes contributions from neighbour rule types. The following formula is obtained: rating (t ) = å rating raw (t ¢)× weight (n(t ¢), d (t ¢, t )) t ¢Îknn( t ) å weight (n(t ¢), d (t ¢, t ) ) , where weight (n, d ) = wn (n ) × wd (d ) t ¢Îknn (t ) The weight function of distance wd(d) should assign significantly higher weights to nearby rule types. The current definition is taken from [6]. Here, ε avoids division by zero and determines the weight obtained at zero distance. wd ( x ) = ln 1 x(1 - e ) + e To weight in the number of responses, the following function can be used: , n=0 ì c w (1) wn (n) = í 1 n î0.1 + lg c 2 n , n > 0 The weight grows with the number of responses, but the rate of growth is logarithmic. The parameter 0 < c1 < 1 determines how the weight of a rule type with no corresponding responses compares to the weight of a rule type with a single response. The curve can be compressed or expanded in the y-axis by adjusting the value of c 2 . Figure 12 shows wn (n) for c1 = 0.5 , c 2 = 1 and 0 £ n £ 15 . 1.4 1.2 weight 1 0.8 0.6 0.4 0.2 0 0 5 10 number of responses 15 Figure 12. Weight as a function of the number of responses. 40 4.4. COLLECTIVE INTERESTINGNESS When determining the raw rating of a given rule type, as it was mentioned before, we would like to boost the impact of positive feedbacks. Introduce a parameter c that determines how much more watched is worth than trash. The following function can be used: ì 0.5 ï rating raw (t ) = [n = cnW + nT ] = í cnW ïî n , n=0 , n>0 Here, nW and nT is the number of times the given rule type appears on the watched and trash list, respectively. Note that ratingraw(t) is a value between 0 (uninteresting) and 1 (interesting). Thus, rating(t) is also a value in this interval. Further, to determine the distance between two rule types, an approach similar to that described in [6] can be used. If sc is the similarity between the sets of categorical attributes of two rule types, sq is the similarity between the corresponding quantitative attributes, and sgn(t) is set to 1 for rule types with positive diff and -1 for rule types with negative diff, then the distance between two signed rule types can be defined by: 1 , sgn(t1 ) ¹ sgn(t 2 ) ì d (t1 , t 2 ) = í î1 - sc (t1 , t 2 ) × sq (t1 , t 2 ) , sgn(t1 ) = sgn(t 2 ) If the similarity between categorical attributes and quantitative attributes is known, the question remains how to calculate the similarity between two sets of categorical attributes. Recall the calculation scheme outlined in Section 2.5. The distance there depends on the rating of the neighbour rule type. To overcome this instability, it is suggested to use an in-between variant of this scheme, as if the rating was consistently neutral, i.e. the following formula is used: æ ö sc (t1 , t 2 ) = 0.5 × ç Õ max(s(q, r ) r Î C (t 2 )) + Õ max(s(q, r ) q Î C (t1 ) )÷ ç qÎC ( t ) ÷ rÎC ( t2 ) 1 è ø Please note that d(t1, t2) as defined above is not necessarily a distance function in the strict mathematical sense. It is non-negative and symmetric (with the updated formula for similarity of sets of categorical attributes), but the triangle inequality is not necessarily satisfied (depending on how similarity between single attributes is defined). This, however, does not make it less useful in a nearest-neighbour setting. Finally, to reduce the complexity of the model, the number of neighbours k used in the calculations can be fixed. Given the definitions above, a nearest-neighbour solution rating rule types given user feedback can be implemented. As such, k-nearest neighbours is a convenient technique to use. The underlying idea is simple and easy to understand. It is especially important in a setting with inquisitive users. A result obtained with a nearest-neighbour algorithm is plain to motivate as it is just a weighted mean of similar neighbours. Moreover, it is an online technique in the sense that new data can be incorporated into the model at once (with a relatively straightforward implementation, the complexity of adding a feedback to the system is O(knRT), where k is the maximal number of neighbours and nRT is the number of rule types), and the system does not need to be retrained anew, as opposed for example to support vector machines. On the other hand, as the discussion above suggests, in the given setting the overall level of complexity of the model is relatively high. It is necessary to choose a function for calculating raw rating of a rule type, to define a distance, and to build a weight value from distance and the number of responses. When all of this adds up, the choice of the nearest-neighbour approach becomes less obvious. Still, as we will see in the following section, once it is set up, well-tuned and working, it can be effectively used for personal interestingness estimation as well. 41 CHAPTER 4. INTERESTINGNESS MODELS 4.5 Personal interestingness Apart from collective users’ preferences, as it was mentioned in Section 2.1.7, individual interactions and contributions of a single user are important when a collective intelligence approach is applied in a product. Personalization is increasingly gaining importance in data mining applications [26]. In the previous section, it was shown how feedback from many users can be used to predict interestingness of rule types. Here, we will focus on predicting interest of a single user. The nearest-neighbour model described above receives feedbacks and produces interestingness predictions. In a completely analogous way, this model can predict interest of an individual user if the only incoming feedback is the feedback from this user. In practice, however, there is a hitch. To predict interestingness by using feedback from all users, only one such structure is necessary. To personalize the results, a separate model for each individual user is necessary. In the case we are studying, each feedback concerns a rule being classified as either watched or trashed. For a single user, the number of feedbacks will be relatively low. If the number of feedbacks available for the current user is nF , the number of known rule types is nRT , and the maximal number of neighbours is k , the cost of processing all user’s feedbacks is O (knRT nF ) . Furthermore, if the set of rule types that need to be rated can be narrowed down to under a small value, the procedure becomes fast, as nRT is replaced by a smaller number (in the best case, the training procedure becomes linear in the number of feedbacks). However, before applying individual feedbacks, the initial neighbour structure needs to be set up. Especially in the case when few feedbacks are available, it is important to take unrated rule types into account. For example, if a rule type has nine close neighbours that do not have associated feedbacks, and one far-away positive feedback, the result should probably be closer to neutral. To account for this, the neighbour structure should be built up with neutral ratings (i.e. 0.5) for each rule type. Then, when new feedbacks arrive, they complement the model and replace the corresponding neutral nodes. In this way, the predictions made by this model will take neutral nodes into account. Training a neutral model requires O (knRT nRT ) operations. Thus, the total training time increases ( ) to O (knRT (nRT + nF )) , or to O kn*RT (nRT + nF ) if the number of rule types that need to be rated is different from the total number of rule types. This can easily be avoided if the neutral model is cached. As the set of rule types is constant between data mining runs, a neutral neighbour structure can be calculated once and saved. Then, a personal model for a single user can be built on the fly, as the number of personal feedbacks is low. 4.6 Combined relevance None of the three relevance values discussed in the previous sections (statistical significance, collective interestingness and personal interestingness) is alone sufficient to describe interestingness of rules. Statistical significance is based on diff and the p-value and does not take into account users’ preferences. Collective and personal ratings, on the other hand, do not rely on objective values at all. Collective interestingness does not directly convey the personal preferences of the current user, while personal interestingness only relies on the user’s personal feedback. Moreover, even when collective and personal interestingness models work exceptionally well, objective measures have to be taken into account anyway, as rules with the same rule type will have different severity level indicated by diff and the p-value. A combination of the three values would cover all of these aspects. As such, the problem of combining objective and subjective rule interestingness measures is not studied enough [26], and we will try to come up with a reasonable combination suitable for our specific setting. 42 4.6. COMBINED RELEVANCE Statistical significance is a relatively objective value. Users’ preferences should be able to compete with statistical significance, but should not overshadow it: statistically strong rules ought to come through in any case. Further, personal preferences should make an impact, but a strong collective opinion should not be overlooked. There are many ways to combine these three values. The approach taken here is illustrated in Figure 13. p diff collective rating personal rating * statistical significance collective score Hotness Figure 13. Principal scheme for calculating hotness. Statistical significance is firstly combined with the collective rating, and the result is further combined with the personal rating in a similar manner. Note that all three values used here are in the range between 0 and 1, so the result is also a value in this range. The values c1 and c2 are combination parameters chosen depending on how the three relevance values are intended to be balanced together. c1 × ( statistical significance) + collective rating c1 + 1 c2 × (collective score) + personal rating hotness = c2 + 1 collective score = It is easy to show that this is equivalent to defining the result as a weighted linear combination of statistical significance, collective and personal rating. However, we prefer the notation above. The advantage of this scheme is that it is relatively easy to reason about. The dependency between the input variables and the output value is simple, and manual parameter tuning in accordance with intuitive expectations is possible. Furthermore, the notation used in the above formula suggests a natural practical scheme for incorporating personal rating into the calculation. The number of users of the system can be quite large (see Table 1), and keeping an updated personal interestingness model for each user in memory is hardly feasible. If, on the other hand, the model is rebuilt during a rule fetching query, each rule needs to be associated with the corresponding personal rating, the resulting relevance score computed, and the rule list sorted by relevance. Instead, a combination of statistical significance and collective rating can be used to fetch the candidate rules, and personal score used to rate the resulting set in the last stage, similarly to what Figure 13 shows. Of course, this produces less exact results than applying personal rating before fetching the rules. However, in practice the relative weight of personal rating will be small compared to statistical significance and collective rating. Furthermore, more rules than necessary can be fetched in the first stage, the total score computed over these and the necessary number of top relevant rules of the resulting list returned. For the case of statistical significance defined as in Section 4.2, its value for each rule is fixed and can be computed only once after rules are mined. Collective interestingness of rule types 43 CHAPTER 4. INTERESTINGNESS MODELS will be updated each time a new feedback arrives. Personal rating for a single user, as described above, will be calculated on the fly and only for a limited number of rule types. It should be noted that other schemes for calculating total interestingness may work as well. More reasoning and experiments are necessary to study this. In our case the results given by the combination formula discussed in this section were found satisfactory. 4.7 Hotness Although it is fairly simple to explain the main idea behind the combination formula from Section 4.6, the resulting interestingness value is a combination of several relevance measures that is not self-explanatory, especially considering all the complexity of the inputs. The obtained interestingness is a value between 0 and 1. Showing it to users that are not familiar with the details of the interestingness calculation mechanism is not only of little use; it may even be counter-productive, as it would undoubtedly increase the perceived complexity of the main system and add to the overall confusion caused by the abundance of various data mining terminology that users need to deal with anyway. Choosing a display name for the interestingness measure is important. Labelling it “Rank”, “Rating”, “Interestingness” or “Significance” leaves room for misinterpretation and gives no hint of what the value is based on. “Collective rank” or “Statistical significance” is misleading, as the value is influenced by other factors as well, and separating the value into components defeats the main purpose of this combined interestingness measure in the first place. “Relevance” is acceptable. It can be a combination of several factors and describes the purpose of the value well. However, it does sound technical and somewhat unexciting. Furthermore, displaying exact values calls for a way of interpreting the values in their own right as well as differences between them. If the user sees two rules marked with 0.657836 and 0.642398, a natural question arises of how significant the difference between the two values is. That is why it was found important to come up with an alternative, synonymous way to convey interestingness values to the users of the system. The sought-for metaphor was found in the name “Hotness” and a graphical representation in the form of coloured versus black-and-white chilli peppers. The logo was filled with colour to the extent corresponding to the total interestingness value, and the rest was left uncoloured. Hotness is a fun name, it catches the user’s attention, associates well with the peppers, and suggests a plain interpretation: the hotter the rule the better. It also makes the rule exploration process sound more like a game and does not call for much more explanation. Practice shows that choosing a suitable name and form of presentation plays an important role for acceptance of a concept, be it a function, a value, a technique or a system; it may be as important as coming up with the technical solution. In the context of the major user acceptance models, the two factors fit into Attitude toward behaviour in the Theory of reasoned action and the Theory of planned behaviour, Perceived ease of use in the Technology acceptance model, Affect towards use in the Model of PC utilization, Ease of use in the Innovation diffusion theory, as well as Affect in the Social cognitive theory (see [34] for further details). During an internal evaluation, hotness of rules on the chilli scale was found much more appealing, exciting and easier to accept than the other alternatives. 44 5 Rule visualization Section 2.4.4 presents two views of the role of data visualization tools. In our application we decided to use visualization to support rule exploration and analysis, and not as part of the rule discovery process. In this chapter, we study two problem areas. The first one is visualizing a set of rules. Here, we look for a way to present relevant rules in a compact and yet informative way that can help the user to get a general picture of the most important issues found during data mining. The ambition is to come up with a visualization scheme that helps to build such a picture faster and easier than by going through an ordered list of rule in text format. Secondly, we look at how standard data visualization techniques can be used to aid users in analyzing individual rules. 5.1 Displaying multiple rules In the case of quantitative association rules that we focus on in the present work, the rule antecedent consists of categorical attributes and their values, and the consequent is a quantitative attribute with an associated mean value. The two standard approaches (two-dimensional association matrix and directed graph) to visualizing sets of association rules are not directly applicable in this case. In a two-dimensional matrix (Section 2.4.1), as the antecedents consist of categorical attributes and values, it would be necessary to use these attribute-value pairs instead of antecedent items. Furthermore, as combinations of these are allowed, an extension for several items has to be used. This makes an individual antecedent in the graph complex, leading to a complicated matrix even when few rules are included. The second axis, unlike the standard case, would contain quantitative attributes. For a directed graph (Section 2.4.2), again, categorical attributes with values would either result in a lot of nodes (a node per attribute-value pair) or, if the categorical values were encoded in arcs, make the arcs more difficult to interpret. End-nodes representing the quantitative attributes would have to represent their values as well. When added up, these complexity factors can potentially make the graph pointless, as the idea with having the graph in the first place is to make understanding of the rules easier, not to complicate it further. The idea about iterative focusing by Blanchard et al. summarized in Section 2.4.3, although interesting, is not suitable in this context either. We seek a more direct way of displaying a number of rules to the user; the user should be able to get a general picture without (or before) doing further exploration. A form of iterative focusing will though be used in another context, namely when rules related to the one the user chose to focus on are presented in the GUI. The idea by Wong et al. about illustrating the rule-to-item relationship [36] may on the other hand be more appropriate. In their association matrix, the rows represent items and the columns correspond to item associations, i.e. rules. For each rule, antecedent items and the consequent are represented by equally high blocks of two fixed colours. Rule statistics (for example, confidence and support) are shown as bars for each column at the top of the matrix. We take this approach and the extended two-dimensional matrix (where combinations of items are allowed) as the starting points, combine and adjust them further to work in our setting. The most immediate adjustment would be to stick to the item-to-rule idea, but to separate the categorical and the quantitative attributes. The consequents occupy the top rows of the matrix, and the antecedents are listed below, each being a categorical attribute-value pair. Each column is a rule, the corresponding categorical pairs are marked, and the corresponding quantitative 45 CHAPTER 5. RULE VISUALIZATION attribute is marked with an icon that represents relevant rule statistics. In Figure 14, categorical pairs are marked with crosses and quantitative attributes are marked with circles. The size of the circles is proportional to the rule’s support and the colour indicates the p-value (transitions from yellow to red correspond to differences between higher and lower p-values). Diff is shown as text under each circle. Rule 1 Question A Rule 2 Rule 3 -1.8 Question B -4.4 Question C -3.4 Attribute A = v1 X Attribute B = v2 X Attribute B = v3 X Attribute C = v4 X Figure 14. Rule-to-item visualization of impact rules. In a practical implementation, the graph can be interactive to facilitate further analysis. For example, when the user points at a column, the corresponding categories can be highlighted. While this representation gives an overview of a set of rules, it only allows a single rule per column and, more importantly, the number of categorical attribute-value pairs is likely to grow with the number of rules and thus complicate the graph. An alternative approach is to take a segment-to-item perspective. Instead of looking at rules versus items, we can look at antecedents (in our setting, an antecedent represents a customer segment) versus consequents, like in the classical two-dimensional matrix case. This at once allows displaying more rules in the same graph (when there are several rules with the same antecedent). Unlike the popular visualization scheme that uses three-dimensional bars (see Figure 6), this graph indeed is two-dimensional, which means that rules do not mask each other. Further, crosses in Figure 14 do not convey other information than incidence of attribute-value pairs to specific rules. Instead, if a row corresponds to a categorical attribute, the crosses can be replaced by values, and that will automatically indicate incidence, as well as solve the problem with the exploding number of pairs. The suggested changes are illustrated in Figure 15. Segment 1 Question A Segment 2 Segment 3 -1.8 -2.2 Question B Question C -2.7 Attribute A v1 -4.4 -3.4 Attribute B v2 Attribute C v3 v4 Figure 15. Segment-to-item visualization of impact rules. These changes make the graph more compact, but also help uncover structural information in the rule set that is displayed. By changing column correspondence from rules to segments we 46 5.1. DISPLAYING MULTIPLE RULES make it obvious which rules relate to the same group of responses. By keeping a single row for each categorical attribute we make it plain to see how rules relate to attributes (instead of showing how rules relate to attribute-value pairs). This can be facilitated by means of another interactive feature: when the user points at an attribute the corresponding rules are highlighted. Also, column labels can be removed as they do not convey any relevant information. Note that although rules within a column correspond to the same segment of responses, it need not necessarily mean that the support of these rules will be the same, as it is not guaranteed that every data record that is mined has all the quantitative attributes filled will non-null values. In theory, this graph is able to show a lot of information in a compact format and make it easy for the users to get an overview of the existing rules. In practice, the number of columns is limited. This means that the number of rules that can be shown is also limited to s∙q, where s is the number of segments that can be fit into the graph, and q is the corresponding number of quantitative attributes. Furthermore, in the worst case, if each rule adds a previously unused categorical attribute to the graph (for example, in the case when the left-hand side of each rule consists of a single unique categorical attribute) and concerns a previously unused quantitative attribute, the number of displayed rules will be only min(s, q) . Notice, though, that this is not necessarily a bad result. The two-dimensional matrix and the rule-to-item approach by Wang et al. suffer from a similar problem. The segment-to-item approach is in any case not worse than rule-to-item. It is guaranteed to be better (i.e. able to show more rules) when most categorical attributes belong to several rules, which is the typical situation in practice. Further, to make most use of the graph, the following graph populating strategy is suggested: · Fill the graph until the number of distinct segments in the graph reaches n = min(s, q) . · While there are more rules and more free slots left in the graph, add the next rule to the graph if it fits in an already existing segment and the corresponding question slot is not occupied. Naturally, it is also necessary to check that the total number of attributes is not more than can be displayed. Rules that add attributes that do not fit anymore should be skipped. If the rules are fed into the above algorithm in the order of relevance, the graph in the worst case will show min(s, q) relevant rules, and the rest will be a selection of less relevant rules. The presented segment-to-item approach and the corresponding strategy for populating the graph can present a more useful overview of multiple quantitative association rules than a simple rule list in the order of relevance. Rule-to-item has the advantage that the individual rules are made more explicit, as each column corresponds to a single rule. The segment-to-item approach elicits more principal dependencies in the structure of rules. We believe that by studying the correspondence of segments to quantitative attributes in rules, deeper insights into rule structure can be gained. However, individual rules are less explicit and it is recommended to implement additional interactive visual clues (in particular, highlighting of appropriate attributes) in order to make it easier for the users to study of the graph. The presented graph can not give an overview of all mined rules. Although in the worst case the graph will present relatively few relevant rules and the rest will be filled out with less relevant results, in practice the graph will give a fair overview of the top relevant rules. This is well consistent with what we set out to achieve. Elements of the focusing approach from Section 2.4.3 can be incorporated into the graph by letting the user specify the set of quantitative attributes of focus (and possibly the sign of diff as well). Then the rule graph will present relevant rules with the corresponding right-hand side. By looking at the general overview first, the user may identify an interesting subset of questions, and further focus on these in particular. This makes even more sense if the questions are logically grouped. Then, if a subset of such a group is represented in the general overview, focusing 47 CHAPTER 5. RULE VISUALIZATION of the corresponding group will reveal other relevant rules that in the original graph were masked by results more relevant in another sense (for example, statistically). A separate matter worth looking at is how to group attributes and segments to make the results easier to apprehend. One possibility is to arrange the attributes in alphabetical order. The obvious advantage with this approach is that it makes it easy to locate attributes of interest. Attributes can be logically grouped, and then it is appropriate to sort by group and by name within each group. The groups themselves should be indicated on the graph as well. Another strategy would be to display the attributes and segments in the order of appearance in the rule list. This will leave the most relevant results in the top left area, or along the diagonal. In any case, visually the more relevant results will be somewhat grouped together. Segments consisting of several categorical attributes add complexity to this matter, as it is preferable to have the attributes within a segment together in the list. The segments are easier to read this way. Luckily, though, in practice most rules have a single categorical attribute (apart from the unit group attribute and the trend attribute), and the necessity of such grouping is not immediate. Note that the suggested visualization scheme is as such only applicable to rules within a single unit group. For a single business unit, it is likely that the capacity of the graph will be enough to give a good representation of the existing rules. If rules concerning several unit groups need to be presented together, a modification of the graph is necessary. One solution is not to consider the rule’s unit group when filling the graph and then label the rule icon with the corresponding unit group in the graph. Another possibility is to allow several rule icons (corresponding to several unit groups) in the same graph slot. Further, it might be interesting to visualize other dependencies. For example, visualizing the correspondence between sets of categorical attributes and quantitative attributes in rules within a single unit group or several groups will give another structural view of the mined rules. Moreover, more rules can be fit into such a metasegment-to-item graph and give an even broader view of the rule set. 5.2 Data visualization for individual rules In the target application, it is useful not only to visualize a whole set of rules, but also to aid users in studying single rules. A simple way to achieve this is to use traditional data visualization techniques to show the data behind a rule. More specifically, we chose to accompany each rule with distribution and trend graphs. Figure 16. Response distribution graph. Bar graphs show the distribution of survey responses for each categorical attribute of the rule. Each bar on such a graph corresponds to a response alternative. For each alternative, the confidence interval is displayed as a thin vertical line and it emphasizes the interval where responses are expected to be found with 95% confidence. The bar corresponding to the attribute value in the rule is highlighted. The visual clue is simple: if the top of the bar is outside the confidence 48 5.2. DATA VISUALIZATION FOR INDIVIDUAL RULES interval, the deviation from the expected response value is significant. The number of responses available for each alternative is indicated. Figure 16 shows an example of a distribution graph. When looking at the distribution graph, it may be helpful to know how many null responses there are in the group. If such responses exist, we show them in a separate bar. Further user tests are necessary to find out whether showing uncategorized responses is helpful or just confusing. When the left-hand side of the rule contains several categorical attributes, a bar graph is displayed for each attribute. The graph is built from responses where all the categorical values except for the plotted one are as in the rule. Also, a trend graph is displayed for each rule. For trend rules, this graph contains a single curve that follows the average response to the rule’s question over time, and the time period the rule concerns is marked. For non-trend rules two curves are shown: one follows the responses behind the rule, and the second one follows all responses in the rule’s unit group. For example, the curve for the rule “unit X, attribute Y → satisfaction” will be accompanied by the curve based on all responses to satisfaction at unit X. The curves are fit into the response points using weighted linear least squares regression. Such curves are known as Lowess curves [18]. Individual survey responses are marked with crosses. They are spread out before plotting to make the graph more readable and to give the user a better idea of the sample size, which is why the vertical position of the crosses may visually deviate from the allowed discrete response values. For rules from the watched rules list, an arrow at the bottom of the graph shows since when the rule has been watched. Figure 17 shows the corresponding trend graph for such a rule. Figure 17. Trend graph for a watched trend rule. A future experiment is to add more curves to compare against, especially in the case of rules with multiple categorical attributes. These curves can include responses from the whole available population (when the categorical attributes on the left-hand side are not unit groupspecific), and curves for the other attributes and their combinations. For a rule of the form {a1,a2,a3}→xq, it might be interesting to see the curves for a1→xq, a2→xq, a3→xq, as well as {ai,aj}→xq and {a1,a2,a3}→q. The symbol →x stands for “in unit group X”. Another important task is to evaluate the selected curve fitting method. Lowess is computationally intensive and sensitive to outliers. So far, it has shown fully acceptable results. However, if problems arise, other possibilities should be considered. 49 6 Results and discussion This chapter describes the solution that was implemented. We start with discussing the architecture and the implemented user interface features. Further, some of the identified issues and possible solutions are examined, including calculating the similarity between attributes, the need for comments mining and a further simplification of the workflow, as well as the choice of the difference value to show to the users. The use of interestingness measures is discussed, and ideas about how to further improve the implemented visualization tools are presented. The chapter is concluded with general remarks about the implemented system. From the very start, it was decided to build the system iteratively, as building a perfect product in one gulp is scarcely realistic. The first phase, however, was bound to be extensive. The goal was to have the most important system components in place in an extent that would make it possible to start testing the system with real clients soon and involve them into the development of the product in posterior phases. 6.1 Simplified architecture Several major simplifications were made to the system to facilitate the first phase of the project, in line with what was accentuated in Section 3.5.5. The simplified system architecture used for implementation is shown in Figure 18. Data mining engine Rule server Database Data miner rules survey data Data fetcher rules survey data and mining preferences rules Rules Watch/Trash Filters Ranker Rule agent rule query ranked rules Application rules MiningDataDistributor Mining front-end Figure 18. Simplified system architecture. One major difference from the general architecture is that the control module is missing. Although the control module is an important part of the system (see Section 3.5.4), it is fully possible to start using the system without it. Therefore it was decided to postpone its implementation to a later project phase. In the first version of the system, mining needs to be started manually from the host application. Another simplification is the lack of support for incremental updates of the satisfaction database on the data mining engine side. In fact, no database is used by the DME. Instead, all data that is necessary for mining is sent over from the main application each time mining is due. Also, the mined rules are not backed up on the DME after mining and are sent to the rule server directly. Furthermore, mining is done only on the business unit level in the first project phase. This reduces the applicability of the system but also simplifies the implementation, especially when it 50 6.1. SIMPLIFIED ARCHITECTURE comes to GUI functionality. On the data mining side, the upgrade that would allow mining arbitrary unit groups is quite simple and is described in some more detail in Section 3.5.2. However, access to rules is still implemented in accordance with what business units the user has access to via the unit group the user is assigned to (the user is allowed to browse the rules corresponding to all leaves that are reachable from the user’s node in the unit group structure). Figure 19 shows the simplified data mining engine. The data link between the application and the DME, as well as between the DME and the rule server was implemented via RMI (remote method invocation), a technique allowing to call methods of remote interfaces. Input and output data structures are serialized in a binary format. This is particularly suitable when large amounts of data are transferred, which is what expected to happen in both cases. Potentially millions of records are sent from the application to the DME, and hundreds of thousands of rules (and more) can be generated. Application Data mining engine MinerThread su rvey d ata and minin g pre fe re nces MinerThread via RMI MinerThread Rule server IntegratedMiner surve y da ta rules DataFetcher ru les via RMI Figure 19. Simplified data mining engine. Watched and trashed rules are crucial to the workflow outlined in Section 3.4 and were implemented in the first phase. However, watched rules are not stored or cached on the application side as was suggested in Section 3.5.1. The main idea with storing watched rules on the application side was to boost the performance by eliminating unnecessary calls to the rule server. Not doing so is not critical and is fully transparent for the users of the system. Following the discussion in Section 3.5.3, watched rules are replaced with their newly mined counterparts when a new rule set arrives to the rule server. To simplify initial implementation, watched rules for which no counterparts can be found in the new rule set are left intact. Thus, no changes are necessary on the DME side. This is acceptable for the most part, because focusing on the rule will reveal an up-to-date trend graph that shows the development of the studied mean response value over time, thus making updates to the formal diff of the rule less relevant. Naturally, the trend rules are marked with absolute dates to avoid outdating of relative values. If a rule concerns the last 10 days, tomorrow it will refer to the last 11 days, although it should be made clear what data the rule is based on, i.e. whether the rule is based on the most current data or up to a certain date in the past. Although potentially useful, classification of unit groups has been left out. All unit groups are assumed to belong to the same class. This implies that no distinction between preferences of users at different levels will be made, i.e. collective interestingness will be shared among all users of the system (for the same client) in spite of the requirement 2.d from Section 3.3 stating that interestingness of rules can be perceived differently by users on different levels. It is believed that complying with this requirement would not noticeably improve users’ experience, especially considering that only business unit level rules are supported in the first project phase. Distinguishing between collective preferences of users on different levels makes much more sense when there are rules on all levels. Thus, this simplification is justified. Moreover, technically the upgrade does not pose any principal difficulties. Several workflow features, namely leaving comments on rules, sharing of rules between users and the rule assignment mechanism, were also left out of the first phase. These features are important to the workflow, but their principal part is application-specific and includes a lot of GUI engineering, so they can be implemented separately when the rest is in place and the system has 51 CHAPTER 6. RESULTS AND DISCUSSION been tested on real clients. This also gives the opportunity to start out with a simpler interestingness model that does not rely on this type of feedback. In addition, clients’ feedback on the core of the system functionality may suggest reworking parts of the system. It is thus logical to wait with the extra functionality until the core has been tried out and polished. Finally, click-through information is not gathered in the first phase of the project. To begin with, it is not incorporated into any of the interestingness models from Chapter 4. However, implementing click-through data collection is important and should be done in one of the subsequent phases of the project. As suggested in Section 2.3.5, click-through can not only be used for rule interestingness estimation, but also to evaluate existing interestingness models. This is further discussed in Section 6.7. The simplified rule server is illustrated in Figure 20. The data interface towards the main application was implemented as XML services in order to make the communication more generic and to ensure consistency with the communication schemes used between the application and other components. This only can become a problem when sets of rules are sent back to the application often. However, a significant part of information about the rules is packed down into a string structure before being externalized to XML, partly to avoid the overhead of specifying the various fields that is part of both XML and RMI serialization. Thus, sending rules over in another standard format should not make a principal difference in performance. Rule server RuleAgent DME Application rules rules rules RMI interface Database RuleDbAdapter Wa tch/Tra sh ru le q uery RuleCompany XML services filte r qu ery ran ked rules Ranker Disk NeutralRanker Figure 20. Simplified rule server. The rule ranking mechanism was implemented in accordance with Section 4.6. The rule server can handle multiple clients. There is only one interestingness level, therefore for each client a single collective ranker is trained with user feedbacks according to the collective interestingness model from Section 4.4. In addition, a neutral ranker is built each time the rule set is replaced (and thus new rule types can emerge). It is later used for personal rating calculation as was suggested in Section 4.5. 6.2 User interface The GUI components implemented on the main application side in the first phase of the project include the rule list allowing the users to browse mined rules, the rule explorer where rules can be studied in substantially more detail, the watched rules list where watched rules can be browsed (it was decided not to show the trash list to the users), the mining setup interface where the administrator can specify mining parameters and initiate mining from, as well as the segment-to-item rule overview graph from Section 5.1. The implemented rule list roughly corresponds to something in between the simple and the advanced viewing modes from requirement 1.c under System requirements. We chose to wait with implementing the truly advanced rule browsing mode, but allowed filtering and sorting rules in the simpler interface anyway. The rule list is illustrated in Figure 21. For each rule, its hotness on the chilli scale is indicated (in the form of coloured chilli peppers as suggested in Section 4.7), a textual description is given, and the average response value is 52 6.2. USER INTERFACE shown together with the difference from the expected response and the rule’s impact (see Section 4.1). Trend rules are marked with a timeline icon (a downward arrow for negative rules and an upward arrow for positive rules). Users can filter the rules by node in the unit group structure, by question category (all quantitative attributes are grouped in the application) and by the sign of diff (positive or negative). As it was already mentioned in Section 3.2, trend rules have been found to be of special importance, which is why it is possible to filter by the trend attribute as well (present or absent). Sorting by hotness, average response, diff and impact is allowed. The available combinations of filtering and sorting settings allow carrying out focused analysis of the rule set effectively. The watched rules list has a similar format, except that no filtering is necessary. Figure 21. Rule list. To let the user quickly work through a list of rules, the textual description is short by default and only contains the right-hand side attribute and the unit group the rule concerns. If a rule seems interesting, the user can see the full textual representation in the rule list as well, as illustrated in Figure 22. The text generation scheme is simple. Basically, the following formula is used: The <support> [guests] with <left-hand side attribute-value pairs> [at] <unit group> average <average response> on the [survey question] <quantitative attribute>, which is <diff> [above/below] the expected value. Rule components fill out the fields marked with <>. Some of the words in the sentence are marked with []. They have to be replaced appropriately depending on the business context and the rule the text is generated for. This includes business-specific terms (e.g. “guest”, “survey question”) and context-dependent words (e.g. “at” might need to be changed depending on the type of the unit group, whereas the choice between “above” and “below” depends on the sign of the diff). Handling of business-specific terms was in our case already implemented as part of the internationalization API in the host application. Figure 22. Full textual description of a rule in the rule list. More details about a rule can be found in the rule explorer. To begin with, the graphs from Section 5.2 were implemented. For each categorical attribute of the rule, the distribution of responses is shown in a bar graph (values for the rest of the attributes are fixed as in the rule). The trend graph shows the responses behind the rule over time and the average response at the given unit group for the whole time period to compare against. 53 CHAPTER 6. RESULTS AND DISCUSSION Following the guidelines from Section 3.4, a selection of individual responses that the rule is based on is made available, together with the corresponding text comments left by the respondents. For negative rules, the lowest responses with filled text fields are shown. In many cases this allows to find the source of the problem. For positive rules, the comments corresponding to the highest responses are shown. In addition, a selection of related rules is given. An obvious idea is to show other rules with the same left-hand side, i.e. rules concerning the same group of responses. The importance of other related rules is less obvious and they are positioned separately on the rule explorer page. These include rules with the same left-hand side attributes at the given unit group. For example, for the rule stating that females are less satisfied with overall service at a specific store, other genderrelated rules at the same store are shown. Similarly, rules with the same quantitative attribute at the given unit group are shown. In the example above, it would be other rules concerning overall service at this store. Also, a selection of these two categories at all properties the user has access to is shown. In total, we present four groups of related rules apart from the rules concerning the same group of responses. Since the control module was left out in the initial phase of the project, a relatively complex data mining setup interface had to be implemented in the main application. To begin with, it is necessary to specify the categorical and quantitative attributes to build rules from. When mining for logically advanced rules, it is important to be able to derive values from existing fields and mine the derived values. For example, it might be interesting to find rules concerning the day of week of visiting a store. The day of week of visit can be derived from the date of visit. However, semantically this should be done before data is sent over to the data mining engine, as this type of transformation has nothing to do with the mining process itself. This feature was implemented in the setup interface. When the DME mines unexpected relations, generated rule candidates are tested against other unit groups. If a relation is common, it will not be reported as an unexpected rule and will in our case be identifiable by other modules in the main application. Some of the categorical attributes are unit group-specific and such rules should not be compared against other groups. For example, a rule concerning bathroom quality at a specific room in a hotel should be reported directly because room number is an attribute specific to the hotel in question and it makes little sense to compare it with rooms at other hotels. Unit group-specific attributes must be specified in the setup interface. Note that the implemented solution only mines rules at business units. If generic mining is allowed, unit group-specific attributes may need to be specified for each level. Other mining settings include the trend attribute (of all available date fields, a single one will be used for trend rule mining), trend period (the maximum time span to allow for trend rules, e.g. do not look for trends more than 3 months ago), trend granularity (time unit to use for trend mining, i.e. track changes in satisfaction by day, week, month, etc.) and the mining period (only responses in the specified time span will be mined). To allow early pruning of rule candidates, the minimum allowed sample size (the minimum number of responses that a rule can be based on), the minimum diff (the deviation of the observed value from the expected value should be at least this high) and the maximum allowed p-value need to be specified as well. Another important mining-related setting is the number of categorical attributes allowed in a rule. In the implemented data mining engine, the number of attributes can technically be arbitrary. In our experiments, though, we have seen that allowing more than two categorical attributes in the same rule is unlikely to result in useful rules. Rules with more than two attributes are uncommon because first of all the support becomes lower as more segments are combined, and this often either results in lower statistical significance or support below the minimum allowed sample size. Also, useful rules usually have a simpler underlying reason that can be described with few attributes. Hence, we allow rules with only two categorical attributes by default. Note that the unit group attribute and the trend attribute are treated separately. Thus, rules like “The 10 guests with Gender: Female, Waiter: X, and Visit date<15 days ago at Restaurant Y average…” can be discovered, giving three segments apart from the unit group. 54 6.3. SIMILARITY BETWEEN RULE ATTRIBUTES In addition, a number of settings that determine how rules are displayed in the GUI are available in the setup. These include the text fields to show in the user comments section in the rule explorer and the fields that will accompany each text comment (for example, the name of the respondent, date of visit to the store, etc. can provide better insight into the responses behind the rule). Also, it may be practical to be able to steer the number of related rules to show and the maximum number of text comments in the rule explorer, the maximum number of rules in the rule list as well as the number of rules per page, etc. 6.3 Similarity between rule attributes Recall from Section 4.4 that it is necessary to know the similarity coefficients between pairs of attributes (separately for categorical and quantitative attributes) in order to determine distance between rule types, which is in turn necessary for collective interestingness estimation. Recall further from Section 2.5 that Bennedich solves this by automatically calculating correlation coefficients for the two sets of attributes. For quantitative attributes, he computes pairwise Pearson correlation coefficients for the set of available responses. For categorical attributes, the identified rules are looked at and the similarity between attributes is defined by how many rules with each quantitative attribute the different categorical attributes are part of. In Bennedich’s experiments, this approach gave good results. In our experiments with the sample databases we had, the coefficients calculated according to this scheme were found less adequate. Most discouraging was the fact that the obtained coefficients did not correspond to our intuitive expectations about which attributes should correlate strongly. The categorical coefficient matrix was the biggest problem, and several modifications to the Bennedich’s calculation scheme were tried out. First of all, in the original algorithm a round of smoothing was run on the obtained correlation matrix. The idea was to find “illogical” cases where rik ∙ rkj > rij for some attributes i, j and k, and smooth out such cases by reducing rik and rkj slightly and increasing rij so that rik ∙ rkj = rij. On our datasets this smoothing procedure made the coefficients more distant from what was expected, so it was decided to skip the smoothing step. Another idea was to include the quantitative correlation coefficients obtained in the first stage of the coefficient calculation process into calculating categorical coefficients. Instead of simply looking at how many rules there are with each quantitative attribute for a given categorical attribute, one could take the similarity between the quantitative attributes into account. Integrating this into the calculation made a significant improvement of the perceived quality of the obtained categorical correlation matrix. However, the coefficients still did not fully correspond to the intuitively expected results. From the user’s perspective, the statistical co-occurrence of attributes in rules is of little significance. What is important is that the system reacts as the user would expect. Intuitively, bathroomrelated questions are perceived as similar independent of what the set of responses or the set of rules suggests. Moving a rule to the watched list or the trash list should affect other rules that are perceived as similar, not necessarily those that are statistically similar. Moreover, the automatic coefficient calculation scheme above is sensitive to the input data, which is not guaranteed to be consistent each time it arrives to the input of the DME. Quantitative coefficients will vary with available response data, while categorical coefficients will change even when mining thresholds such as minimum support and diff change. To avoid this instability and to make the coefficients consistent with users’ intuition, it was decided to give up the idea of unguided coefficients deduction. Instead of calculating the coefficients automatically, the coefficients have to be specified in the setup interface. It would of course be tedious to fill in the coefficients by hand, so the idea is to group the various attributes into logical groups. Different attributes within the same group have a fixed similarity coefficient that is less than 1. Correlation between attributes from different groups is 0. The procedure could be further simplified by using the fact that the attributes were already grouped in the host 55 CHAPTER 6. RESULTS AND DISCUSSION application for other purposes. Thus, only two correlation coefficients needed to be specified: the correlation between two different attributes in the same group of categorical attributes, as well as the corresponding value for quantitative attributes. When data that should be mined is sent over to the DME, a set of mining preferences is sent as well. In particular, the two correlation matrices calculated as described above are sent as part of the preferences data structure. If a more fine-grain coefficient calculation scheme is implemented later, only the corresponding module in the application will have to be updated. The DME and the rule server are independent of the coefficient calculation scheme. 6.4 Comments mining Text comments from the responses behind a rule can provide valuable insights into the underlying factors that can help explain an unexpected satisfaction mean. In practice, the suggested scheme for selection of text comments does not give consistently satisfactory results. A typical survey has many satisfaction questions but few comment fields, so the comments can cover a whole range of issues or a specific issue that does not necessarily coincide with the studied satisfaction parameter. With the implemented comments selection mechanism it is not uncommon that most of the text comments shown in the rule explorer are of no particular interest in the context of the rule the user is focused on, or that a comment is quite big and only a small part of it concerns the question of the rule. What is more, people tend not to comment on positive experiences and mostly write about negative incidents. This is why showing text comments for positive rules in fact rarely works at all, as most of these concern other factors that people were not content with. Clearly, a more rigorous technique for handling text fields is called for. Our early reviews of the product made it clear that the clients realize the high value of text comments in rule exploration, including the case of positive rules. Text mining can be employed to identify the topics raised in the comments as well as the emotional charge of the opinions (positive or negative). With this type of classification engine in place, only comments that actually have something to do with the question of the rule will be shown, and only those that have the right tone, i.e. positive for rules with a positive diff and negative for rules with a negative diff. Text mining was not studied as part of this project, but rather in the format of a separate parallel project. In future phases of the data mining project, the text mining module will be employed in the rule explorer for automatic selection of relevant text comments. The following relevant text mining literature can be recommended for the interested reader willing to implement a text mining solution in a setting similar to ours: [40], [41], [42] and [43]. 6.5 Streamlined workflow An idea that emerged during one of the product evaluation sessions is to streamline the rule list further. The rule list as presented in Section 6.2 is still relatively complex and takes time and energy to work through. A user like a business unit manager is typically a busy person anyway and an easier, more immediate and less overwhelming presentation would in many cases be appreciated, especially when it comes to the number of rules shown. Knowing right away that there are many rules to go through may discourage a busy user from working with the module. A possible solution to this problem is to implement a lightweight version of the rule list that only presents a brief set of rules (the top rules from the rule list, a selection of rules from the watched list, or possibly a combination of both as suggested in Section 3.4) and lets the user to take action on them or dive in to see the details by linking to the rule explorer, similarly to the Google Reader home page module which lets the user mark, know what has been read, get a tool tip and dive deeper. This lightweight rule list can be shown on the user’s start page in the application, together with other relevant summaries. This automatically brings the idea of work- 56 6.6. DIFFERENCE VALUE IN THE GUI ing with rules into the everyday set of activities. The user does not necessarily need to actively go into a separate page devoted to data mining. On the other hand, users who work with rules more actively and want to study existing rules closer might prefer to use the standard rule list. A possible visual design for the streamlined rule list is illustrated in Figure 23. Also, the watched rules list needs to be developed further. Presently, it simply lists the watched rules and it is left to the user to track any changes in the watched rules. A smarter solution should be worked out. For example, the user could get an automatic notification if a significant change in the average response value of a watched rule has been detected, or if the diff of a rule has changed from negative to positive. Another possibility for making the system easier to work with is to replace all technical terms with suitable counterparts from commonly used language. It is discussed closer in Section 3.4. Figure 23. Short summary of the top relevant rules (insights). 6.6 Difference value in the GUI A relevant issue that was identified is what diff value to show to users. Currently, the diff calculated with the restrictive model described in Section 2.5 is used both internally (in particular, to calculate hotness of rules) and externally (it is shown to the user in the GUI). For rules with a single categorical attribute, diff is the difference between the average response in the given segment and the average at the whole unit group. However, for rules with several categorical attributes the calculation is more complex. Consider the following example: · unit A → satisfaction = 8.0 · unit A, attribute1 → satisfaction = 7.1 · unit A, attribute2 → satisfaction = 7.5 · unit A, attribute1, attribute2 → satisfaction = 2.0 The last relation is clearly interesting. Let us look at the corresponding diff value. The data mining engine will calculate mexp = 8.0 - (8.0 - 7.1) - (8.0 - 7.5) = 8.0 - 0.9 - 0.5 = 6.6 , identify the expected interval to be [6.6, 7.5] and finally find a diff of 4.6, and only if this diff is statistically significant will it report the rule. This diff, however, is incomprehensible for a regular user of the host application. On the contrary, comparing with 8.0 would be easy to understand (and consistent with the single-category rules). This could be presented roughly as “The average for the clients at unit A with attribute1 and attribute2 is 2.0, which is 6.0 points below the average for unit A”. In theory, it may also be possible to understand a comparison with the direct fathers of the itemset of the rule. This would lead to the following statement: “The average for the clients at unit A with attribute1 and attribute2 is 2.0, which is 5.1 points below the average for clients with attrib- 57 CHAPTER 6. RESULTS AND DISCUSSION ute1 at unit A”, or “The average for the clients at unit A with attribute1 and attribute2 is 2.0, which is 5.1 points below the average for clients with attribute2 at unit A”. However, it is not clear which of the two statements should be shown to the user. In addition, it is a more complex statement than just comparing with the average response value at the unit as a whole. All in all, when showing rules to the user, it might be better to show the difference from the unit group average. However, the question remains which difference to use when calculating hotness and impact. Using the internal diff may be misleading, and it makes impact hard to relate to. On the other hand, in our experiments rules with several attributes were significantly fewer than rules with a single attribute, so the side effects of changing the diff are likely to be limited. Further experiments are necessary to see whether this solution is good enough. 6.7 Interestingness The individual strengths and weaknesses of impact and hotness have already been discussed in detail in Chapter 4 where these interestingness measures were introduced and described. When combined, we noticed that the two interestingness values shown in the rule list may be somewhat confusing for average users. Our idea with providing these two measures was that as they are so different, they can be used together to give a better picture of the relevance of rules. Real users wonder instead which of the two parameters is most important for judging rule relevance. Rather unexpectedly, some users looked at the diff more often as they found it easier to relate to. One possibility is to hide the hotness column, but order rules after hotness anyway (this is especially suitable for the lightweight rule list described in Section 6.5). The problem with not having that column is that if the user then sorts on one of the other columns, it will not be possible to get back to the original ordering again. Then perhaps sorting on other columns should not be allowed. Thus, hotness is not shown, but the rules are sorted by the hotness value. For business unit users this may be suitable because the number of rules is low enough that it is possible to go through all of them without needing to sort by different parameters. On the other hand, this could be a more severe limitation for higher level users. More importantly, however, hotness can not as of now be relied upon as the one appropriate rule relevance measure. Although the implementation behaves as expected in accordance with the models from sections 4.3-4.6, no strict enough formal justification of why these models are appropriate was given, apart from an explanation relying on intuitive expectations. Of course, as it was pointed out earlier in this text, interestingness as such is a subjective term, and there is no single right answer to the question of which rules are truly interesting. If there was a way to know the “right” interestingness values for at least a smaller example set of rules, and the hotness model was believed to be fully appropriate, optimization techniques could be used to automatically find suitable model parameters that were so far chosen by hand. Unfortunately, no such training set is available. In an attempt to find a solution to this problem, it can be worth experimenting with ideas by Joachims summarized in Section 2.3.5. The simplest starting point is to try out his scheme for evaluating relative quality of two retrieval functions, by comparing sorting by hotness and by impact, or by hotness and by diff, or by diff and randomly, etc. Further, as it was shown in [22], click-through data can be used to build effective interestingness models. The question is what features should identify a rule in this setting and should be used as input variables in the corresponding optimization problem. Presence or absence of categorical and quantitative attributes could be used, but then a combination measure like the one suggested in Section 4.6 will still have to be employed. Even if that works out in practice, such a solution would not be as responsive as the hotness model. In order to take a new feedback into account, the optimization problem needs to be solved again. Thus, this would have to be done 58 6.8. VISUALIZATION periodically. Further, depending on the performance of the SVM and the amount of available click-through data, calculating personal rating with this method can become unfeasible. In any case, implementation of a mechanism for collecting click-through data should be prioritized to make sure that the necessary feedback is accumulated while further research and experiments are carried out. So far, it is recommended to leave both hotness and impact visible in the rule list and gather more feedback from the users. 6.8 Visualization When it comes to visualization of rules, the three graphs from Chapter 5 were implemented. The segment-to-item multiple rule graph was implemented in Flash in its simplest form that can only handle rules within a single business unit. The distribution and trend graphs for individual rules are on the other hand generated on each load of the rule explorer as non-interactive static images. The graphs, and especially the segment-to-item visualization, are extensively discussed one by one in Chapter 5. Some of the concerns and considerations for the future, apart from what has already been said, are presented below. An important implementation consideration is how to select data for building the distribution and trend graphs. If a rule has high support, plotting all individual responses on the trend graph will make it unreadable and take long time. However, even when the support of the rule is low, the average response curve for the whole business unit for the whole mining period is displayed, and the number of responses there can be unnecessarily large. When the number of available responses is large, a random selection of a fixed number of responses should be used instead. To make the graph stable when the rule explorer page is reloaded, a fixed seed value should be used. Another obvious idea is to reuse the graphs that have already been rendered. However, ideally a different implementation should be used for the distribution and trend graphs. To make them more useful and to involve user into data exploration, the graphs should be made interactive. Among other things, the individual responses in the trend graph should be made clickable and link to a page with detailed information about each response. Also, focusing and zooming on a part of the graph should be implemented. If the number of available responses is large, this will allow studying the individual responses in a given area that are not shown otherwise. For the distribution graph, it should be possible to focus on a bar and explore the responses behind it. In the future, other visualization techniques should be looked at as well. As it was already suggested in Section 3.2, decision trees can be useful for getting a better understanding of data. At least two possibilities should be experimented with: building a tree based on the unit group and the quantitative attribute of the rule (constructing trees with quantitative target attributes is not straightforward though), and building trees freely as an additional tool for data exploration. Also, it may be worthwhile to experiment with more advanced graphs than the relatively simple distribution and trend graphs. In particular, visualizing data in a 3D graph as it was done in [25] can be both helpful and stimulating for the user. For example, for rules with two categorical attributes a 3D graph can be built showing the two categories on the x-y plane and the quantitative attribute on the z-axis. Note that it is also possible to use the segment-to-item graph for a similar type of visualization by looking at it from the category-to-category perspective. The somewhat over-ambitious rummaging approach presented in [9] might be interesting in the context of data exploration rather than rule exploration. If a comprehensive visual data exploration module is integrated, the user can be given the possibility to explore the whole available data set interactively by focusing on various attributes and attribute groups. 59 CHAPTER 6. RESULTS AND DISCUSSION 6.9 General remarks The general system architecture presented in Section 3.5 has been found adequate so far. Based on it, a real functioning rule mining system was implemented and integrated into a live product. Although the architecture and the initially suggested functionality was substantially simplified as described in detail in Section 6.1, most of the simplified and postponed features are planned to be implemented in a fuller scale in later phases of the project. Overall system performance as known at the time of writing is fully satisfactory. In spite of the initial performance concerns, the system actually performed better than expected. Mining a hundred thousand surveys with dozens of categorical and quantitative attributes (tested on hospitality, electronics retail, weather and economy data sets) takes a few minutes on a regular portable computer from initiating a mining run in the setup interface and until the new rule set can be browsed in the rule list, including data preparation and communication between the components of the system. In conjunction with this, the possibility to couple the data miner and the rule server with the main application so that they could be started and stopped synchronously was requested as a separate feature. The system’s response to user actions is smooth. On the GUI side most of the time is spent in dealing with technicalities imposed by the host system. More details will be available once a framework for unit testing of GUI features is in place, and it is under construction at the time of writing. Rule server handles multiple clients and users fast enough as confirmed by stress testing with unit tests. Scalability and applicability of the architecture was discussed closer in Section 3.5.5. Naturally, there are numerous possibilities for optimization, from improvements of purely technical nature (for instance, optimizing serialization of communicated objects in order to speed up data transfer between the components of the system) to more substantial algorithmic changes. For example, if the system’s response to incoming user feedbacks is found to be too slow, one possibility is to employ clustering of rule types so that the most distant neighbours are skipped without being looked at when the collective rank of rule types is updated. However, such optimizations are left for the point it time when performance becomes an issue. In its current form, the data mining module will be most valuable for business unit managers. Implementing support for rules at higher levels in particular and arbitrary unit groups in general (both in terms of mining and working with them via the GUI) will make this tool significantly more universal and applicable. As it was mentioned earlier, the mining part of this upgrade is relatively straightforward. The main question will be how to present such rules to make them truly useful – whether rules at the current unit group and its subgroups should be merged in a single list, or whether they should be summarized separately, and how exactly it should be done to make sure the user is not overwhelmed with too much information to consider. Further, when mining of rules at different levels is implemented, different mining parameters will have to be used. At the business unit level it might be interesting to see rules with a small support (for example, to be able to identify a problem in a single room at a hotel), while such rules are probably irrelevant at the region or country level. On the other hand, it might be still interesting to see rules with a small support at individual properties for users at the middle level, for example in order to inspect small problems as a means of controlling the quality of service at the associated business units. When rules at multiple levels become available, it will be necessary to find a proper balance between the flexibility of the system and the ease of using it. Another high-priority update is to implement the control module, or at least parts of its functionality. The most urgent function is to be able to schedule mining runs. In the beginning, the simplest form of scheduling can be implemented via the setup interface of the main application, where the user specifies how often mining should be re-run and the system initiates updates automatically irrespective of other tasks executed by the application or by the target DME. 60 7 Conclusions This chapter wraps up the thesis. We summarize what has been done and look at how it corresponds to the goals of the project. We briefly reiterate the ideas about how this work can be extended and what areas should be studied further. In the current paper, we developed a practical framework for quantitative association rule mining and gave recommendations on how to present the discovered rules to end-users. In particular, we accomplished the following: · Designed a system for mining and working with quantitative association rules. · Selected a suitable mining technique and suggested an alternative mechanism for generating rules from trend rule candidates that is believed to be more appropriate than the original scheme. · Developed a workflow allowing users with limited technical background to make effective use of the data mining results to support their everyday work. · Suggested a general architecture for the rule mining system that covers a range of practical applications, spreads logically separate functionality between several independent modules and scales well with the number of clients served by the system. · Reasoned about what constitutes a relevant rule in the setting studied in this project and built two interestingness models that together cover factors relevant for rule interestingness reasonably well. Impact is a measure that is easy to understand and that connects well with the business context. Hotness is an experimental interestingness measure that gives a user-specific interesting value for each rule. It combines statistical significance of the rule with collective intelligence of all users and personal preferences of the current user. · Suggested the segment-to-item technique for visualizing multiple rules in a compact and accessible format, as well as the metasegment-to-item approach that is a further generalization that groups rules by structure. In addition, data visualization for illustrating individual rules was discussed. · Implemented a working system based on these ideas and integrated it into an industrial software product proving that the developed techniques and designs are feasible in a practical application. · Listed concrete ideas for improvement and areas for future research, including: o Further simplification of the workflow. The lightweight rule list was introduced and the importance of term substitution was emphasized. o Evaluating the interestingness models suggested in this paper through user tests and by means of analyzing click-through data. In particular, the possibility of deriving an interestingness model from click-through should be investigated closer. A possible approach to how this can be done was sketched out. o Further development of visualization tools through improved interactivity and more advanced graphs. o Implementing a text mining module capable of extracting feature and opinion information from text comments in order to give better context information about the rules. Therefore we conclude that the goals of the project outlined in Section 1.2.1 were achieved. 61 CHAPTER 7. CONCLUSIONS It should be emphasized that this paper presents a work in progress. The results presented here are not final. Although much has been done, a lot is left for subsequent project phases. In particular, proper evaluation of results has not been carried out yet. Many of the suggested ideas are based on the intuition about what is necessary and appropriate, and more reliable conclusions can be drawn when the solution described here has gone through a few more development phases and has been exposed to real users for some time. It is believed that the ideas, requirements, comments and general reasoning presented in this paper will be of practical value for others developing a rule mining system. This paper discusses a broad range of practical and theoretical details that need to be considered in a similar context. Probably the most significant contribution of this work is that it shows how association rule mining can be brought closer to real users in all parts of the executive hierarchy and how it can be transformed into a tool that can be used on a daily basis to help people do their job better, rather than reserving the technique for expert users or using it to facilitate exclusive high-end consultant projects. 62 References The books and articles referenced in the current thesis are listed below, ordered alphabetically by the names of the authors. [1] Aggarwal, C.C. Towards effective and interpretable data mining by visual interaction. SIGKDD Explorations 3(2), pages 11-22, 2002. [2] Agrawal, R., Imielinski, T., and Swami, A. Mining association rules between sets of items in large databases. Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207-216, New York, NY, USA, 1993. ACM Press. [3] Agrawal, R. and Srikant, R. Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Databases, pages 487-499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [4] Alag, S. Collective intelligence in action. Manning, 2008. ISBN 1933988312. [5] Aumann, Y. and Lindell, Y. A statistical theory for quantitative association rules. Journal of Intelligent Information Systems, 20(3): pages 255-283, 2003. [6] Bennedich, M. Mining survey data. Technical report, Royal Institute of Technology, Department of Numerical Analysis and Computer Science, Sweden, 2008. [7] Bennett, P. N. Assessing the calibration of naive Bayes’ posterior estimates. Technical report CMU-CS-00-155, Carnegie Mellon University, School of Computer Science, Pittsburgh, PA, 2000. [8] Berry, M. and Linoff, G. Data mining techniques: for marketing, sales, and customer relationship management. 2nd edition. Wiley, 2004. ISBN 0471470643. [9] Blanchard, J., Guillet, F., and Briand, H. Exploratory visualization for association rule rummaging. In MDM/KDD’03, Washington, DC, USA, 2003. [10] Boser, B. E., Guyon, I. M., and Vapnik, V. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh, 1992. ACM Press. [11] Burges, C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, pages 121-167, 1998. [12] Chen, Q. Mining exceptions and quantitative association rules in OLAP data cube. Technical report, Simon Fraser University, Canada, 1999. [13] Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20: pages 273297, 1995. [14] Frei, H. and Schäuble, P. Determining the effectiveness of retrieval algorithms. Information Processing and Management, 27(2/3): pages 153-164, 1991. [15] Fukuda, T. and Morishita, S. A visualization method for association rules. Technical report DE95-6, Institute of Electronics, Information and Communication Engineers, pages 41-48, 1995. [16] Hahsler, M. Annotated bibliography on association rule mining. Visited on 2008-09-17. http://michael.hahsler.net/research/bib/association_rules/ [17] Hand, D., Mannila, H., and Smyth, P. Principles of data mining. MIT Press, Cambridge, MA, 2001. ISBN 0-262-08290-X. [18] Hutcheson, M. C. Trimmed resistant weighted scatterplot smooth. Technical report, Cornell University, Ithaca, NY, 1995. 63 [19] Jain, A. and Dubes, R. Algorithms for clustering data. Prentice Hall, 1988. [20] Jaroszewicz, S. and Scheffer, T. Fast discovery of unexpected patterns in data, relative to a Bayesian network. Proceedings of 2005 ACM Int. Conf. on Knowledge Discovery in Databases, pages 118-127, 2005. [21] Joachims, T. Evaluating retrieval performance using clickthrough data. Proceedings of the SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2002. [22] Joachims, T. Optimizing search engines using clickthrough data. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2002. [23] Keim, D.A. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics 8(1), pages 1-8, 2002. [24] Liu, B., Hsu, W., Chen, S., and Ma, Y. Analyzing the subjective interestingness of association rules. IEEE Intelligent Systems, 2000. [25] Nagel, H.R., Granum, E., and Musaeus, P. Methods for visual mining of data in virtual reality. In PKDD International Workshop on Visual Data Mining, 2001. [26] Natarajan, R. and Shekar, B. Interestingness of association rules in data mining: Issues relevant to e-commerce. Sadhana Vol. 30, Parts 2 & 3, pages 291-309, 2005. [27] Obata, Y. and Yasuda, S. Data mining apparatus for discovering association rules existing between attributes of data. US patent US006272478B1, 2001. [28] Pearl, J. Bayesian Networks: A model of self-activated memory for evidential reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, CA, pages 329-334, 1985. [29] Säfström, M. 2005. Analysis of a support vector machine for visual classification. Technical report TRITA-NA-E05166, Royal Institute of Technology, Department of Numerical Analysis and Computer Science, Sweden, 2005. [30] Seragan, T. Programming collective intelligence. O’Reilly, 2007. ISBN 0596529325. [31] Shen, X. and Zhai, C. Active feedback in ad hoc information retrieval. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59-66, 2005. [32] Srikant, R. and Agrawal, R. Mining quantitative association rules in large relational tables. Proceedings of the ACM SIGMOD Conference on Management of Data, 1996. [33] Tan, P., Kumar, V., and Srivastava, J. Selecting the right interestingness measure for association patterns. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pages 32-41, 2002. [34] Venkatesh, V., Morris, M., Davis, G., and Davis, F. User acceptance of information technology: toward a unified view. MIS Quarterly Vol. 27 No. 3, pages 425-478, 2003. [35] Webb, G.I. Discovering associations with numeric variables. Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 383388, New York, NY, USA, 2001. ACM Press. [36] Wong, P.C., Whitney, P., and Thomas, J. Visualizing association rules for text mining. Proceedings of the IEEE Symposium on Information Visualization, pages 120-123, 1999. [37] Xin, D., Shen, X., Mei, Q., and Han, J. Discovering interesting patterns through user’s interactive feedback. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 773-778, 2006. [38] Zhang, H., Jiang, L., and Su, J. Augmenting naive Bayes for ranking. Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005. 64 [39] Zhang, Z., Lu, Y., and Zhang, B. An effective partitioning-combining algorithm for discovering quantitative association rules. Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1997. Several relevant sources within text mining not directly used in this thesis are listed below for the benefit of the interested reader. [40] Hu, M. and Liu, B. Mining opinion features in customer reviews. AAAI-2004, San Jose, USA, July 2004. [41] Hu, M. and Liu, B. Mining and summarizing customer reviews. KDD’04, Seattle, Washington, USA, August 2004. [42] Liu, B. Web data mining. Springer, 2007. ISBN 978-3-540-37881-5. [43] Spangler, S. and Kreulen, J. Mining the talk: unlocking the business value in unstructured information. IBM Press, 2007. ISBN 978-0-13-233953-7. 65 TRITA-CSC-E 2009: 005 ISRN-KTH/CSC/E--09/005--SE ISSN-1653-5715 www.kth.se