Download Visualizing interestingness

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Visualizing interestingness
Fabio Di Fiore
Research & Development Dept.
Information Inventor srl, Italy.
[email protected]
Abstract
Many recent techniques try to evaluate the interestingness of patterns
automatically extracted by data mining algorithms. In this paper we propose an
interestingness measure able to be visually analyzed which looks at the concept
of unexpectedness in a data driven perspective, addressing both objective and
subjective issues.
The proposed approach shifts the focus from patterns to the data underlying
them, considered as the primary source of information: data can be more
functionally visualized than any artificial yet compact representation of it.
Starting from the well-known star schema model, we propose a new operational
definition of patterns, based on the visual evaluation of the distributions of
involved dimensions. Consequently we suggest a suitable scheme for evaluating
the interestingness of the patterns based on the given representation: we show
how to choose the best ordering of the categories of a dimension, strictly
dependent on the measure involved.
To cope with the subjective traits of the nature of interestingness, we propose a
supervised learning approach based on the same visual driven perspective, able
to customize interestingness with respect to users’ expectations. The last
section of the paper confirms our claims by outlining effective results we have
obtained from our system at work.
1 Introduction
Whatever business managers have to administer, they have nowadays to deal
with a huge amount of data, typically collected through past years of activity:
the more detailed the information, the more abundant the data. The basic
objective of data mining techniques is to extract useful patterns (i.e. actionable
knowledge) from these large databases.
State of the art systems, however, still discover too many redundant patterns. In
the past few years many attempts [1], [2], [3], [4] have been made to identify the
best criteria for evaluating the interestingness of patterns, as only few among
them are really relevant to the decision makers.
So far, two main classes of approaches to the evaluation of interestingness have
been identified: the techniques belonging to the first group are based on
objective measures of interestingness, the others on subjective ones.
Objective measures, such as confidence, support, strength, simplicity [5], [6],
[7], [8] are focused on the statistical strength of a pattern, while subjective
measures [2], [9], are based on the assumption that the interestingness of a
pattern strongly depends on the personal expectations of the analyst.
In this paper our main concern is to define an effective interestingness measure
apt to be visualized, so that final users of the system can easily understand it.
Following [1], [10], [11], we consider interesting patterns those exceptions
found in subsets of a reference population, and the measure we use to evaluate
the interestingness of those exceptions is based on a comparison between values
in the reference population and values found in the given subsets. At the same
time, we’ll suggest a way to subjectivize the given visual interestingness, by
means of a supervised learning approach.
2 Pattern representation
Agrawal et al. [8] catalog Data Mining tasks in three main categories –
classification, association, and sequences. The most popular techniques for
classification tasks are neural networks and decision trees. Neural networks
encode knowledge as weights of neurons, thus making it inaccessible: in this
case there is no explicit representation of pattern. In the second approach the
classification rule is encoded in the decision tree itself (which is by definition a
set of rules); other situations exist where the task of data mining is to discover
association rules in the form of logical implications; then a pattern is itself a
rule. This is also the most usual case; indeed a pattern is often considered
(implicitly or explicitly) a rule [4], [9].
Here we’ll apply these concepts to the star schema framework, in order to
exploit its descriptive power. In the context of multidimensional data analysis,
also known as OLAP (On Line Analytical Processing), a model has been
developed which is able to capture the business analyst’s point of view. This
model, named the star schema, given its power and utility, is now the de facto
standard for the representation of any context for data analysis purposes.
The star schema represents the context under examination in terms of
dimensions and measures (nominal and numerical variables) and the strong
connection between this approach and the search for patterns in databases is
quite evident, as already emphasized by [12].
In fact, [12] suggest that in the context of OLAP the most useful analyses take
note of the values of the measures which correspond to specific given values for
the dimensions; and they put in parallel this information with that conveyed by
the patterns.
The search for patterns can be defined as the search for a conjunction of
conditions (i.e. constraints) able to select subsets of data in which interesting
(i.e. unexpected) values for measures are found.
The approach just described enables a classical representation of patterns. An
example of pattern is the following:
IF Age = “15-20” AND Sex = “F”
THEN Favourite_Magazine = “FASHION”,
CONF = 60%, SUPP = 3% (Annual_Expense)
where:
-
(1)
Age, Sex and Favourite_Magazine are dimensions;
“15-20”, “F” and “FASHION” are values for the respective
dimensions;
CONF is the confidence and SUPP is the support, both computed on
the given measure (Annual_Expense).
It is worth noting that (1) shows a pattern with a condition expressed over a
dimension in its then branch, and, even if this is not the only possible case,
we’ll concentrate on this type of patterns for the rest of the paper.
In fact, the given representation can also be generalized to the case in which
measures are more than just one and conditions are also defined over measures.
From the star schema model we’ll borrow its strict separation between
dimensions and measures and the conception of interesting analyses as the
search for specific combinations of conditions defined over the dimensions.
Specifically, we’ll exploit the fact that every pattern of the type shown in (1) is
to be evaluated on a distribution of a dimension with respect to a measure (in
the example Favourite_Magazine vs. Annual_Expense).
Our main purpose is to find a representation of patterns that could simplify the
understanding of interestingness concepts by non-technical people. So in the
next sections we’ll describe a representation able to be easily visualized. In fact,
our first step will be to shift attention from the rule representation to the data
underlying it.
3 From patterns to data
The central concept we want to capture in a pattern representation is its
regularity, in the sense that the constancy found in data (and expressed by the
rule) can be applied to new, unseen data, obtaining the same results.
But the representation of a pattern in terms of a simple if-then rule, with
conditions in both if and then branches, isn’t able to capture most of this
regularity. We also believe that capturing this regularity is crucial for the
definition of an effective interestingness measure. In fact, our first step is to take
into account the whole distribution of the dimension involved in the then branch
of the pattern, as we argue that it’s able to better capture the effect of the
variable on the phenomenon under analysis.
To clarify this concept, let’s consider the simple pattern already shown in (1).
This pattern can be plotted with the variable appearing in its then branch on the
horizontal axis, and the measure upon which confidence and support have been
computed on the vertical one (Figure 1). This distribution is subject to the
conditions expressed by the if branch of the pattern, which act as a filter.
100
90
80
70
IF Age = “15-20” AND Sex = “F”
THEN Favourite_Magazine = “FASHION”
CONF = 60%, SUPP = 3%
Annual 60
Expense 50
40
30
20
10
0
MUSIC
FASHION
SPORTS
FRIENDS
Favourite
Magazine
Figure 1: Representation of a pattern in terms of frequency distribution.
The given representation opens the door to all the statistical techniques based
on the comparison of distributions and tests of hypotheses. As a matter of fact,
the suggested change of perspective forces us to see the pattern through the
comparison of two distributions (Figure 2), rather than just one.
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
IF Age = “15-20” AND Sex = “F”
THEN Favourite_Magazine = “FASHION”
CONF = 60%, SUPP = 3%
10
0
0
SPORTS
FASHION
MUSIC
FRIENDS
SPORTS
FASHION
MUSIC
FRIENDS
Figure 2: Representation of a pattern through the comparison of distributions
With reference to Figure 2, the first chart is obtained by simply plotting the
variable in the then branch of the pattern (without the if filter), while the second
is that already shown in Figure 1. With this representation in mind, it’s
plausible that a significant difference in the shape of these distributions should
be taken into account for the evaluation of the interestingness of the given
pattern.
The visual approach described still leaves a number of open questions. With
reference to Figure 2, it is worth noting that without the consideration of the
whole distribution there would be no information on the following issues:
1.
2.
3.
4.
the dispersal of the subset of the population that, still applying the
conditions in the if branch, does not fall into the predicted value of the
then branch (this population is given in crossed bars in Fig. 2);
if there are any other significant peaks, given the same conditions in
the if branch;
the relative importance of the peak signaled by the then branch with
respect to the others;
the relative change in the importance of the signaled peak obtained
imposing the conditions in the if branch of the pattern.
Those questions are especially important in dealing with weak patterns, i.e.
those patterns having low confidence, typically more interesting than others [1],
[10].
In light of what has been discussed, in the next section we’ll show an approach
able to utilize all this information, taking advantage of the suggested
representation in order to best evaluate the interestingness of the patterns.
4 Interestingness evaluation
From the discussion of the previous section we can conclude that a pattern is
interesting if it is different with respect to a known distribution; or rather: the
more different the two distributions, the more interesting the pattern. It should
be noted that interestingness depends on the whole shape of the distributions,
not simply on single statistics, such as mean, median, etc.
With this conclusion in mind, it is easy to be convinced that our definition of
pattern is suitable to be applied to every task: it suffices to always consider a
pattern with respect to an expected distribution (henceforth a reference pattern).
So, if the reference pattern is the distribution of a variable over the whole data,
we call local pattern the distribution of the same variable obtained selecting
fewer data from the database, i.e. imposing stronger constraints. Analogously, if
the reference pattern is a pattern found out some time before, the distribution of
the same variable with the same constraints is defined as a temporal pattern.
Here we want to concentrate on the questions raised in the previous section,
suggesting a way to use those insights for an effective interestingness evaluation
and visualization methodology. Our guess is that a crucial factor to be taken
into consideration when dealing with the comparison of distributions is the
relative ordering (in terms of frequency) of the categories.
Let us consider two frequency distributions – a reference pattern and, say, a
local pattern. The unexpectedness of the local pattern should be somewhat
related to the dissimilarity between its shape and that of the reference pattern.
However, both shapes strongly depend on the ordering of the x-axis values,
completely arbitrary because values are the categories of a nominal attribute
(dimension), so intrinsically unordered.
In Figure 3 we show how it is possible to use the information of the relative
ordering in order to best capture a pattern’s novelty or unexpectedness. To
clarify this point, suppose that the most relevant peak in the reference
population has become, by imposing pattern’s conditions, the least relevant one;
and vice versa, an insignificant peak in the reference population is now the
main factor. In these situations it would be harmful not to consider the
information pertaining to the relative orderings!
100
100
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
IF Age = “15-20” AND Sex = “F”
THEN Favourite_Magazine = “FASHION”
CONF = 60%, SUPP = 3%
10
0
0
MUSIC
SPORTS
FRIENDS
FASHION
MUSIC
SPORTS
FRIENDS
FASHION
Figure 3: Influence of relative ordering on interestingness evaluation
Therefore, we suggest imposing on the dimension axis an ordering induced by
the measure itself: with reference to Figure 3, the best ordering is obtained
when the reference pattern is monotonic.
In other words, our proposal is to define an ordering that can be considered
“natural”, so that all the patterns should be compared on the imposed ordering.
In fact, the relative ordering of a data distribution, under the conditions imposed
by the if branch of the pattern, represents a crucial side of the intrinsic
regularity of the phenomenon. Also, by comparing this ordering with the
original relative ordering (without imposing the if conditions) it’s possible to
capitalize information stored within noise (data not following the rule).
4.1 Analytical discussion
In this subsection we want to suggest some possible analytical expressions based
on the given insights. Our approach can be seen as the result of two main
contributions. First, for each category, the difference between y-values is to be
taken into account, in order not to neglect the whole shape of the distributions.
Many analytical formulas could be used, e.g. the chi-squared test or the area of
the intersection between the two (normalized) distributions. The final decision
on which technique to use depends on the chosen overall elaboration strategy.
The second main contribution we consider is the difference in the relative
orderings of the distributions. Given a dimension of cardinality m, there are m!
possible permutations of that distribution leading to different scenarios.
To avoid heavy computations, we propose to calculate the difference between
the medians of the two distributions, normalized so as to fluctuate between 0 (if
no change has occurred) and 1 (if the ordering has completely reversed).
The two contributions outlined could be used to evaluate a pattern’s
interestingness, from a visual (and objective) point of view:
I (P )=c1I C + c2 M D
where:
-
(2)
I is the interestingness of the pattern P;
Ic is the complement of the intersection between normalized
distributions areas;
MD is the distance between the two medians;
c1, c 2 are normalizing coefficients.
The formula (2) evaluates to the interval (0,1). Values toward zero, denoting a
pattern completely expected, are obtained if the distribution perfectly matches
the reference pattern: in such a case, Ic = 0 and also MD = 0.
It’s worth noting that by adding the support as a further addendum to (2), it
could be possible to better target different problems, as this contribution should
balance (2)’s tendency to favor the discovery of very specific and deviating
behaviors.
5 Supervised learning of users’interestingness
In this section we’ll outline an approach to the evaluation of the subjective traits
of interestingness based on a targeted supervised training of the system,
obtained by means of a visual evaluation of patterns. This technique has been
adopted in our system, named INFORMATION INVENTOR , bestowing
significant results.
The interestingness measure shown in (2) is based on an objective approach.
However, as pointed out by [7], the concept of interestingness has also a
subjective nature, and this fact should be taken into account.
Being aware of that, our approach considers a further step in which the visual
interestingness evaluation previously described is subjectivized, customized on
users’specific needs.
To obtain this behavior, the system uses a supervised self-learning approach,
with the estimated target function being interestingness itself, and analyzed
(and approved) patterns acting as the training set.
Customization of interestingness function is obtained through the following
steps:
1) for each pattern, on the basis of the predefined criteria, the basic
interestingness contribution Io is computed:
I o (U o , P )=∑ cko Fko
N
where:
(3)
k =1
-
Io is the basic Interest Rate (IR);
Uo is the generic user;
P is the pattern;
Fko are the features taken into account (compare to (2));
cko are normalization coefficients;
2) the scored patterns are shown to the user in the described manner so
that he can decide, for each of them, whether to leave unchanged or to
modify the default IR;
3) for each pattern selected by the user to be modified, some near misses
(i.e. similar patterns [13]) are also shown to be validated;
4) this way a training set has been obtained, and can be submitted to the
supervised component (in our system a back propagation neural
network) which will estimate the function Ii(Ui), by obtaining its
coefficients:
N
where:
I i (U i , P )=∑ cki Fki
(4)
k =1
-
Ii is the customized Interest Rate (for user Ui);
Ui is the current user;
P is the pattern;
Fkj are the features taken into account (compare to (2), (3));
ckj are coefficients estimated by the supervised component;
5) in each successive rating, the system will use a compound
interestingness function:
I i (U i , P )=c1I o (U o , P ) + c2 I i (U i , P )
(5)
where the coefficients c1 and c2 can be used to emphasize one of the
contributions.
6 System at work
This section reports effective results we have obtained from our system at work
in a telecommunication fraud detection project.
Effective fraud management approaches require the use of a right combination
of supervised and unsupervised techniques. In a project developed jointly to a
leading Italian telecommunication provider, our system was used to detect
suspected fraudulent behaviors to be further analyzed by human experts. In fact,
our main concern was to quickly find new kinds of fraudulent behaviors, not
recognizable by means of a supervised learning approach.
Before our involvement started, OLAP tools and trigger facilities were used in
order to discover deviating users’ behaviors. The availability of such tools and
the presence of a ready-to-use data mart greatly facilitated our work.
In order to find potentially fraudulent behaviors the analysis was done at user
profile level: given some dimensions (e.g. CALLING_DISTRICT) and some
measures (e.g. CALL_DURATION), the data was grouped by the CALLER_ID
field. This way each signaled pattern would have emphasized a significant
change in a user’s behavior. Of course the reference patterns were not only
global behaviors, but also well-known typical fraud patterns.
Due to data privacy policy, we cannot give full details about system architecture
and obtained results. It will suffice outlining one of the improved performances
obtained by means of the described approach, with regard to the subsystem we
were involved in. The main improvement to be ascribed to the use of our system
is by no doubt the reduction of the Mean Fraud Detection Time (MFDT) for
those behaviors properly triggered as potentially fraudulent (MFDT was reduced
by a factor of 50%). As the total number of CDR (Call Data Records) processed
by our subsystem was up to 5.106/day, this result had a significant impact on the
overall system performances.
7 Conclusion
In this paper we proposed an approach for evaluating interestingness based on
visually driven insights. From the OLAP world we borrowed its basic
conception of data analysis as the visual inspection of graphical reports
representing the distribution of dimensions with respect to measures, under
given constraints. Paralleling this approach with that of pattern discovery, we
addressed the problem of interestingness evaluation by exploiting the
information carried by the relative ordering of a frequency distribution. In
addition, to deal with the subjective traits of interestingness, we showed an
innovative approach based on a targeted supervised training of the system. This
technique has been adopted in our system INFORMATION INVENTOR ,
bestowing significant results, outlined in the last section of the paper.
References
[1] Piatetsky-Shapiro, G. & Matheus, C., The interestingness of deviations.
KDD-94, 1994.
[2] Klemetinen, M., Mannila, H., Ronkainen, P., Toivonen, H. & Verkamo,
A.I., Finding interesting rules from large sets of discovered association
rules. CIKM-94, pp. 401-407, 1994.
[3] Silberschatz, A. & Tuzhilin, A., What makes patterns interesting in
knowledge discovery systems. IEEE Trans. on Know. and Data Eng. 8(6),
1996.
[4] Freitas, A.A., On rule interestingness measures. Knowledge-Based
Systems 12, pp. 309–315, 1999.
[5] Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong
rules. Knowledge Discovery in Database, eds. G. Piatetsky-Shapiro and
W. Frawley, AAAI/MIT Press, 1991.
[6] Frawely, W., Piatetsky-Shapiro, G. & Matheus, C., Knowledge Discovery
in Database: an overview. Knowledge Discovery in Database, eds. G.
Piatetsky-Shapiro and W. Frawley, AAAI/MIT Press, 1991.
[7] Silberschatz, A. & Tuzhilin, A., On subjective measures of interestingness
in knowledge discovery. Proc. of the 1st Int. Conf. On Knowledge
Discovery and Data Mining, Montreal, pp. 275-281, 1995.
[8] Agrawal, R., Imielinski, T. & Swami, A., Mining Association Rules
Between Sets of Items in Large Databases. Proc. of the ACM SIGMOD
Conference on Management of Data, pp. 207-216, 1993.
[9] Liu, B., Hsu, W. & Chen, S., Using general impression to analyze
discovered classification rules. Proc. of the 3rd International Conference
on Knowledge Discovery and Data Mining, Newport Beach, California,
USA, pp. 31-36, 1997.
[10] Liu, H., Lu, H., Feng, L. & Hussain, F., Efficient search of reliable
exceptions. Proc of 3rd Pacific-Asia Conference on Knowledge Discovery
and Data Mining (PAKDD'99), eds. N. Zhong and L. Zhou, Beijing,
China, pp. 194-203, 1999.
[11] Kloesgen, W., Deviation analysis. eds. W. Kloesgen and J. Zytkow,
Handbook of Data Mining and Knowledge Discovery, Oxford University
Press, New York, 1999.
[12] Imielinski, T., Khachiyan, L. & Abdulghani, A., Cubegrades:
Generalizing association rules. Tech. Rep., Dept. Computer Science,
Rutgers Univ., Aug. 2000.
[13] Winston, P. H., Learning Structural Descriptions from Examples. The
Psychology of Computer Vision, ed. P. H. Winston, McGraw-Hill Book
Company: New York, pp. 157-209, 1975.