Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visualizing interestingness Fabio Di Fiore Research & Development Dept. Information Inventor srl, Italy. [email protected] Abstract Many recent techniques try to evaluate the interestingness of patterns automatically extracted by data mining algorithms. In this paper we propose an interestingness measure able to be visually analyzed which looks at the concept of unexpectedness in a data driven perspective, addressing both objective and subjective issues. The proposed approach shifts the focus from patterns to the data underlying them, considered as the primary source of information: data can be more functionally visualized than any artificial yet compact representation of it. Starting from the well-known star schema model, we propose a new operational definition of patterns, based on the visual evaluation of the distributions of involved dimensions. Consequently we suggest a suitable scheme for evaluating the interestingness of the patterns based on the given representation: we show how to choose the best ordering of the categories of a dimension, strictly dependent on the measure involved. To cope with the subjective traits of the nature of interestingness, we propose a supervised learning approach based on the same visual driven perspective, able to customize interestingness with respect to users’ expectations. The last section of the paper confirms our claims by outlining effective results we have obtained from our system at work. 1 Introduction Whatever business managers have to administer, they have nowadays to deal with a huge amount of data, typically collected through past years of activity: the more detailed the information, the more abundant the data. The basic objective of data mining techniques is to extract useful patterns (i.e. actionable knowledge) from these large databases. State of the art systems, however, still discover too many redundant patterns. In the past few years many attempts [1], [2], [3], [4] have been made to identify the best criteria for evaluating the interestingness of patterns, as only few among them are really relevant to the decision makers. So far, two main classes of approaches to the evaluation of interestingness have been identified: the techniques belonging to the first group are based on objective measures of interestingness, the others on subjective ones. Objective measures, such as confidence, support, strength, simplicity [5], [6], [7], [8] are focused on the statistical strength of a pattern, while subjective measures [2], [9], are based on the assumption that the interestingness of a pattern strongly depends on the personal expectations of the analyst. In this paper our main concern is to define an effective interestingness measure apt to be visualized, so that final users of the system can easily understand it. Following [1], [10], [11], we consider interesting patterns those exceptions found in subsets of a reference population, and the measure we use to evaluate the interestingness of those exceptions is based on a comparison between values in the reference population and values found in the given subsets. At the same time, we’ll suggest a way to subjectivize the given visual interestingness, by means of a supervised learning approach. 2 Pattern representation Agrawal et al. [8] catalog Data Mining tasks in three main categories – classification, association, and sequences. The most popular techniques for classification tasks are neural networks and decision trees. Neural networks encode knowledge as weights of neurons, thus making it inaccessible: in this case there is no explicit representation of pattern. In the second approach the classification rule is encoded in the decision tree itself (which is by definition a set of rules); other situations exist where the task of data mining is to discover association rules in the form of logical implications; then a pattern is itself a rule. This is also the most usual case; indeed a pattern is often considered (implicitly or explicitly) a rule [4], [9]. Here we’ll apply these concepts to the star schema framework, in order to exploit its descriptive power. In the context of multidimensional data analysis, also known as OLAP (On Line Analytical Processing), a model has been developed which is able to capture the business analyst’s point of view. This model, named the star schema, given its power and utility, is now the de facto standard for the representation of any context for data analysis purposes. The star schema represents the context under examination in terms of dimensions and measures (nominal and numerical variables) and the strong connection between this approach and the search for patterns in databases is quite evident, as already emphasized by [12]. In fact, [12] suggest that in the context of OLAP the most useful analyses take note of the values of the measures which correspond to specific given values for the dimensions; and they put in parallel this information with that conveyed by the patterns. The search for patterns can be defined as the search for a conjunction of conditions (i.e. constraints) able to select subsets of data in which interesting (i.e. unexpected) values for measures are found. The approach just described enables a classical representation of patterns. An example of pattern is the following: IF Age = “15-20” AND Sex = “F” THEN Favourite_Magazine = “FASHION”, CONF = 60%, SUPP = 3% (Annual_Expense) where: - (1) Age, Sex and Favourite_Magazine are dimensions; “15-20”, “F” and “FASHION” are values for the respective dimensions; CONF is the confidence and SUPP is the support, both computed on the given measure (Annual_Expense). It is worth noting that (1) shows a pattern with a condition expressed over a dimension in its then branch, and, even if this is not the only possible case, we’ll concentrate on this type of patterns for the rest of the paper. In fact, the given representation can also be generalized to the case in which measures are more than just one and conditions are also defined over measures. From the star schema model we’ll borrow its strict separation between dimensions and measures and the conception of interesting analyses as the search for specific combinations of conditions defined over the dimensions. Specifically, we’ll exploit the fact that every pattern of the type shown in (1) is to be evaluated on a distribution of a dimension with respect to a measure (in the example Favourite_Magazine vs. Annual_Expense). Our main purpose is to find a representation of patterns that could simplify the understanding of interestingness concepts by non-technical people. So in the next sections we’ll describe a representation able to be easily visualized. In fact, our first step will be to shift attention from the rule representation to the data underlying it. 3 From patterns to data The central concept we want to capture in a pattern representation is its regularity, in the sense that the constancy found in data (and expressed by the rule) can be applied to new, unseen data, obtaining the same results. But the representation of a pattern in terms of a simple if-then rule, with conditions in both if and then branches, isn’t able to capture most of this regularity. We also believe that capturing this regularity is crucial for the definition of an effective interestingness measure. In fact, our first step is to take into account the whole distribution of the dimension involved in the then branch of the pattern, as we argue that it’s able to better capture the effect of the variable on the phenomenon under analysis. To clarify this concept, let’s consider the simple pattern already shown in (1). This pattern can be plotted with the variable appearing in its then branch on the horizontal axis, and the measure upon which confidence and support have been computed on the vertical one (Figure 1). This distribution is subject to the conditions expressed by the if branch of the pattern, which act as a filter. 100 90 80 70 IF Age = “15-20” AND Sex = “F” THEN Favourite_Magazine = “FASHION” CONF = 60%, SUPP = 3% Annual 60 Expense 50 40 30 20 10 0 MUSIC FASHION SPORTS FRIENDS Favourite Magazine Figure 1: Representation of a pattern in terms of frequency distribution. The given representation opens the door to all the statistical techniques based on the comparison of distributions and tests of hypotheses. As a matter of fact, the suggested change of perspective forces us to see the pattern through the comparison of two distributions (Figure 2), rather than just one. 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 IF Age = “15-20” AND Sex = “F” THEN Favourite_Magazine = “FASHION” CONF = 60%, SUPP = 3% 10 0 0 SPORTS FASHION MUSIC FRIENDS SPORTS FASHION MUSIC FRIENDS Figure 2: Representation of a pattern through the comparison of distributions With reference to Figure 2, the first chart is obtained by simply plotting the variable in the then branch of the pattern (without the if filter), while the second is that already shown in Figure 1. With this representation in mind, it’s plausible that a significant difference in the shape of these distributions should be taken into account for the evaluation of the interestingness of the given pattern. The visual approach described still leaves a number of open questions. With reference to Figure 2, it is worth noting that without the consideration of the whole distribution there would be no information on the following issues: 1. 2. 3. 4. the dispersal of the subset of the population that, still applying the conditions in the if branch, does not fall into the predicted value of the then branch (this population is given in crossed bars in Fig. 2); if there are any other significant peaks, given the same conditions in the if branch; the relative importance of the peak signaled by the then branch with respect to the others; the relative change in the importance of the signaled peak obtained imposing the conditions in the if branch of the pattern. Those questions are especially important in dealing with weak patterns, i.e. those patterns having low confidence, typically more interesting than others [1], [10]. In light of what has been discussed, in the next section we’ll show an approach able to utilize all this information, taking advantage of the suggested representation in order to best evaluate the interestingness of the patterns. 4 Interestingness evaluation From the discussion of the previous section we can conclude that a pattern is interesting if it is different with respect to a known distribution; or rather: the more different the two distributions, the more interesting the pattern. It should be noted that interestingness depends on the whole shape of the distributions, not simply on single statistics, such as mean, median, etc. With this conclusion in mind, it is easy to be convinced that our definition of pattern is suitable to be applied to every task: it suffices to always consider a pattern with respect to an expected distribution (henceforth a reference pattern). So, if the reference pattern is the distribution of a variable over the whole data, we call local pattern the distribution of the same variable obtained selecting fewer data from the database, i.e. imposing stronger constraints. Analogously, if the reference pattern is a pattern found out some time before, the distribution of the same variable with the same constraints is defined as a temporal pattern. Here we want to concentrate on the questions raised in the previous section, suggesting a way to use those insights for an effective interestingness evaluation and visualization methodology. Our guess is that a crucial factor to be taken into consideration when dealing with the comparison of distributions is the relative ordering (in terms of frequency) of the categories. Let us consider two frequency distributions – a reference pattern and, say, a local pattern. The unexpectedness of the local pattern should be somewhat related to the dissimilarity between its shape and that of the reference pattern. However, both shapes strongly depend on the ordering of the x-axis values, completely arbitrary because values are the categories of a nominal attribute (dimension), so intrinsically unordered. In Figure 3 we show how it is possible to use the information of the relative ordering in order to best capture a pattern’s novelty or unexpectedness. To clarify this point, suppose that the most relevant peak in the reference population has become, by imposing pattern’s conditions, the least relevant one; and vice versa, an insignificant peak in the reference population is now the main factor. In these situations it would be harmful not to consider the information pertaining to the relative orderings! 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 IF Age = “15-20” AND Sex = “F” THEN Favourite_Magazine = “FASHION” CONF = 60%, SUPP = 3% 10 0 0 MUSIC SPORTS FRIENDS FASHION MUSIC SPORTS FRIENDS FASHION Figure 3: Influence of relative ordering on interestingness evaluation Therefore, we suggest imposing on the dimension axis an ordering induced by the measure itself: with reference to Figure 3, the best ordering is obtained when the reference pattern is monotonic. In other words, our proposal is to define an ordering that can be considered “natural”, so that all the patterns should be compared on the imposed ordering. In fact, the relative ordering of a data distribution, under the conditions imposed by the if branch of the pattern, represents a crucial side of the intrinsic regularity of the phenomenon. Also, by comparing this ordering with the original relative ordering (without imposing the if conditions) it’s possible to capitalize information stored within noise (data not following the rule). 4.1 Analytical discussion In this subsection we want to suggest some possible analytical expressions based on the given insights. Our approach can be seen as the result of two main contributions. First, for each category, the difference between y-values is to be taken into account, in order not to neglect the whole shape of the distributions. Many analytical formulas could be used, e.g. the chi-squared test or the area of the intersection between the two (normalized) distributions. The final decision on which technique to use depends on the chosen overall elaboration strategy. The second main contribution we consider is the difference in the relative orderings of the distributions. Given a dimension of cardinality m, there are m! possible permutations of that distribution leading to different scenarios. To avoid heavy computations, we propose to calculate the difference between the medians of the two distributions, normalized so as to fluctuate between 0 (if no change has occurred) and 1 (if the ordering has completely reversed). The two contributions outlined could be used to evaluate a pattern’s interestingness, from a visual (and objective) point of view: I (P )=c1I C + c2 M D where: - (2) I is the interestingness of the pattern P; Ic is the complement of the intersection between normalized distributions areas; MD is the distance between the two medians; c1, c 2 are normalizing coefficients. The formula (2) evaluates to the interval (0,1). Values toward zero, denoting a pattern completely expected, are obtained if the distribution perfectly matches the reference pattern: in such a case, Ic = 0 and also MD = 0. It’s worth noting that by adding the support as a further addendum to (2), it could be possible to better target different problems, as this contribution should balance (2)’s tendency to favor the discovery of very specific and deviating behaviors. 5 Supervised learning of users’interestingness In this section we’ll outline an approach to the evaluation of the subjective traits of interestingness based on a targeted supervised training of the system, obtained by means of a visual evaluation of patterns. This technique has been adopted in our system, named INFORMATION INVENTOR , bestowing significant results. The interestingness measure shown in (2) is based on an objective approach. However, as pointed out by [7], the concept of interestingness has also a subjective nature, and this fact should be taken into account. Being aware of that, our approach considers a further step in which the visual interestingness evaluation previously described is subjectivized, customized on users’specific needs. To obtain this behavior, the system uses a supervised self-learning approach, with the estimated target function being interestingness itself, and analyzed (and approved) patterns acting as the training set. Customization of interestingness function is obtained through the following steps: 1) for each pattern, on the basis of the predefined criteria, the basic interestingness contribution Io is computed: I o (U o , P )=∑ cko Fko N where: (3) k =1 - Io is the basic Interest Rate (IR); Uo is the generic user; P is the pattern; Fko are the features taken into account (compare to (2)); cko are normalization coefficients; 2) the scored patterns are shown to the user in the described manner so that he can decide, for each of them, whether to leave unchanged or to modify the default IR; 3) for each pattern selected by the user to be modified, some near misses (i.e. similar patterns [13]) are also shown to be validated; 4) this way a training set has been obtained, and can be submitted to the supervised component (in our system a back propagation neural network) which will estimate the function Ii(Ui), by obtaining its coefficients: N where: I i (U i , P )=∑ cki Fki (4) k =1 - Ii is the customized Interest Rate (for user Ui); Ui is the current user; P is the pattern; Fkj are the features taken into account (compare to (2), (3)); ckj are coefficients estimated by the supervised component; 5) in each successive rating, the system will use a compound interestingness function: I i (U i , P )=c1I o (U o , P ) + c2 I i (U i , P ) (5) where the coefficients c1 and c2 can be used to emphasize one of the contributions. 6 System at work This section reports effective results we have obtained from our system at work in a telecommunication fraud detection project. Effective fraud management approaches require the use of a right combination of supervised and unsupervised techniques. In a project developed jointly to a leading Italian telecommunication provider, our system was used to detect suspected fraudulent behaviors to be further analyzed by human experts. In fact, our main concern was to quickly find new kinds of fraudulent behaviors, not recognizable by means of a supervised learning approach. Before our involvement started, OLAP tools and trigger facilities were used in order to discover deviating users’ behaviors. The availability of such tools and the presence of a ready-to-use data mart greatly facilitated our work. In order to find potentially fraudulent behaviors the analysis was done at user profile level: given some dimensions (e.g. CALLING_DISTRICT) and some measures (e.g. CALL_DURATION), the data was grouped by the CALLER_ID field. This way each signaled pattern would have emphasized a significant change in a user’s behavior. Of course the reference patterns were not only global behaviors, but also well-known typical fraud patterns. Due to data privacy policy, we cannot give full details about system architecture and obtained results. It will suffice outlining one of the improved performances obtained by means of the described approach, with regard to the subsystem we were involved in. The main improvement to be ascribed to the use of our system is by no doubt the reduction of the Mean Fraud Detection Time (MFDT) for those behaviors properly triggered as potentially fraudulent (MFDT was reduced by a factor of 50%). As the total number of CDR (Call Data Records) processed by our subsystem was up to 5.106/day, this result had a significant impact on the overall system performances. 7 Conclusion In this paper we proposed an approach for evaluating interestingness based on visually driven insights. From the OLAP world we borrowed its basic conception of data analysis as the visual inspection of graphical reports representing the distribution of dimensions with respect to measures, under given constraints. Paralleling this approach with that of pattern discovery, we addressed the problem of interestingness evaluation by exploiting the information carried by the relative ordering of a frequency distribution. In addition, to deal with the subjective traits of interestingness, we showed an innovative approach based on a targeted supervised training of the system. This technique has been adopted in our system INFORMATION INVENTOR , bestowing significant results, outlined in the last section of the paper. References [1] Piatetsky-Shapiro, G. & Matheus, C., The interestingness of deviations. KDD-94, 1994. [2] Klemetinen, M., Mannila, H., Ronkainen, P., Toivonen, H. & Verkamo, A.I., Finding interesting rules from large sets of discovered association rules. CIKM-94, pp. 401-407, 1994. [3] Silberschatz, A. & Tuzhilin, A., What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Know. and Data Eng. 8(6), 1996. [4] Freitas, A.A., On rule interestingness measures. Knowledge-Based Systems 12, pp. 309–315, 1999. [5] Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Database, eds. G. Piatetsky-Shapiro and W. Frawley, AAAI/MIT Press, 1991. [6] Frawely, W., Piatetsky-Shapiro, G. & Matheus, C., Knowledge Discovery in Database: an overview. Knowledge Discovery in Database, eds. G. Piatetsky-Shapiro and W. Frawley, AAAI/MIT Press, 1991. [7] Silberschatz, A. & Tuzhilin, A., On subjective measures of interestingness in knowledge discovery. Proc. of the 1st Int. Conf. On Knowledge Discovery and Data Mining, Montreal, pp. 275-281, 1995. [8] Agrawal, R., Imielinski, T. & Swami, A., Mining Association Rules Between Sets of Items in Large Databases. Proc. of the ACM SIGMOD Conference on Management of Data, pp. 207-216, 1993. [9] Liu, B., Hsu, W. & Chen, S., Using general impression to analyze discovered classification rules. Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, California, USA, pp. 31-36, 1997. [10] Liu, H., Lu, H., Feng, L. & Hussain, F., Efficient search of reliable exceptions. Proc of 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), eds. N. Zhong and L. Zhou, Beijing, China, pp. 194-203, 1999. [11] Kloesgen, W., Deviation analysis. eds. W. Kloesgen and J. Zytkow, Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York, 1999. [12] Imielinski, T., Khachiyan, L. & Abdulghani, A., Cubegrades: Generalizing association rules. Tech. Rep., Dept. Computer Science, Rutgers Univ., Aug. 2000. [13] Winston, P. H., Learning Structural Descriptions from Examples. The Psychology of Computer Vision, ed. P. H. Winston, McGraw-Hill Book Company: New York, pp. 157-209, 1975.