Download Mining quantitative association rules

Document related concepts
no text concepts found
Transcript
Mining Quantitative
Association Rules in Practice
Building a working data mining system:
architecture, workflow, and presentation of data mining results
DMITRY
PALAGIN
Master of Science Thesis
Stockholm, Sweden 2009
Mining Quantitative
Association Rules in Practice
Building a working data mining system:
architecture, workflow and presentation of data mining results
DMITRY
PALAGIN
Master’s Thesis in Computer Science (30 ECTS credits)
at the School of Computer Science and Engineering
Royal Institute of Technology year 2009
Supervisor at CSC was Stefan Nilsson
Examiner was Stefan Arnborg
TRITA-CSC-E 2009:005
ISRN-KTH/CSC/E--09/005--SE
ISSN-1653-5715
Royal Institute of Technology
School of Computer Science and Communication
KTH CSC
SE-100 44 Stockholm, Sweden
URL: www.csc.kth.se
Abstract
In this thesis, we present a practical framework for mining quantitative association rules and
give recommendations regarding presentation of the discovered rules.
A system for mining and working with quantitative association rules was designed. We suggest
a general system architecture that covers a range of practical applications, spreads logically
separate functionality between several independent modules, and scales well with the number of
clients served by the system. Further, we motivate a choice of a suitable mining technique and
present an alternative mechanism for generating rules from trend rule candidates.
A workflow allowing users without previous background in data mining to make effective use
of mined rules to support everyday work is described.
Two models for assessing interestingness of rules are suggested. Impact is an interestingness
measure based on the difference between the observed average and the expected average as well
as on the number of data records behind the rule. Hotness is a user-specific interestingness value
combining statistical significance of the rule with collective intelligence of all users and personal preferences of the current user.
The segment-to-item technique for visualizing multiple rules in a compact and accessible format
was developed, as well as the metasegment-to-item approach that is a further generalization of
this idea. In addition, ways of using data visualization to facilitate analysis of individual rules
are described.
A working system based on these ideas was implemented and integrated into an industrial software product proving that the developed techniques and designs are feasible in a practical application.
Sammanfattning
Titel: Kvantitativa associationsregler i praktiken. Att bygga ett fungerande informationsutvinningssystem: arkitektur, arbetsflöde och presentation av resultat.
I denna uppsats presenterar vi ett praktiskt ramverk för utvinning av kvantitativa associationsregler och ger rekommendationer för hur framtagna regler kan presenteras.
Ett system för informationsutvinning och arbete med kvantitativa associationsregler har tagits
fram. Vi föreslår en generell systemarkitektur som täcker flera praktiska användningsområden,
delar enskilda funktioner mellan flera oberoende moduler och är skalbart med hänsyn till antalet
klienter som betjänas av systemet. Därefter motiverar vi valet av en lämplig utvinningsteknik
och presenterar en alternativ mekanism för framtagning av regler från trendregelkandidater.
Vidare beskriver vi ett arbetsflöde som möjliggör en effektiv användning av resultaten från informationsutvinningen i dagligt arbete för användare utan tidigare bakgrund i data mining.
Två modeller för relevansbedömning av regler föreslås. Impact är ett relevansmått som bygger
på skillnaden mellan det observerade och det förväntade genomsnittet samt på antalet datafall
bakom regeln. Hotness är ett användarspecifikt relevansmått där regelns statistiska signifikans
kombineras med den aggregerade kunskapen från alla användare samt den aktuella användarens
individuella preferenser.
Tekniken segment-to-item, som kan användas för visualisering av ett flertal regler i ett kompakt
och lättillgängligt format, har tagits fram, liksom tekniken metasegment-to-item, som är en mer
generell vidareutveckling av denna idé. Därtill beskriver vi olika sätt att använda visualisering
som ett medel för analys av enstaka regler.
Ett fungerande system som bygger på dessa idéer har implementerats och integrerats i en kommersiell mjukvaruprodukt. Därmed har de presenterade teknikernas och modellernas genomförbarhet i praktiska tillämpningar kunnat påvisas.
Preface
I would like to express my gratitude to everybody who helped me directly or indirectly in the
course of the work presented in this paper. In particular, I want to thank (in alphabetical order):
·
Al for keeping my email box filled up.
·
Amit for making me feel welcomed at the Red Cottage Inn.
·
David for giving up his office so easily and for the wonderful stickers on his laptop.
·
Den for knowing everything about chemistry and for his research on the electronic
structure and geometry of anionic copper clusters in particular.
·
Elise for teaching me to drink coffee without milk and sugar, for letting me use the municipal grand piano and for repeatedly picking me up from work late at night.
·
Erik for patiently answering my countless configuration questions and for giving me
shelter during my time as homeless in Oslo.
·
Erin for showing me around.
·
Erling for nearly magically patching a running application on a remote server before a
workshop with an important client.
·
Georg and Guro for giving me reasons to stay late at work and come to the office on
weekends.
·
Juan Pablo for the discussion regarding career.
·
Lars Christian for lending me his guidebook to San Francisco.
·
Lise for being a delightful shopping mate.
·
Max for consistently beating me in table tennis and for cleaning out the office kitchen
pipes with me on a beautiful summer Saturday (and for helping me out with this project
as well).
·
Michael for helping me out with the abstract in the nick of time.
·
My family for all the inspiration and support.
·
Øyvind for always having a solution to any problem that I though would doom the
whole project, as well as for telling me the story about the tourists and a hot dog.
·
Rostik for taking me and Max out swimming.
·
Rune for the brainstorming sessions and for his signature Chinese serve.
·
Sarah for her energy and sprightly spirit.
·
Snorre for teaching me the Trondheim dialect.
·
Stefan for eagerly agreeing to be the academic supervisor of the project.
·
Stephen and Hugh for “Jeeves and Wooster”.
·
Vidar for saying “The data mining module is up and running... Cough...”
·
West Records for being so selflessly inadequate.
Contents
1
Introduction...................................................................................................................... 1
1.1
1.1.1
Data mining ...................................................................................................... 1
1.1.2
Association rules ............................................................................................... 2
1.2
Purpose ............................................................................................................. 2
1.2.2
Limitations........................................................................................................ 3
Report outline ........................................................................................................... 3
Related work .................................................................................................................... 5
2.1
Data mining concepts................................................................................................ 5
2.1.1
Decision trees.................................................................................................... 5
2.1.2
Clustering ......................................................................................................... 6
2.1.3
Naive Bayes classifier ....................................................................................... 6
2.1.4
Artificial neural networks .................................................................................. 7
2.1.5
Nearest-neighbour techniques............................................................................ 8
2.1.6
Support vector machines ................................................................................... 8
2.1.7
Collective intelligence....................................................................................... 9
2.2
Rule mining ............................................................................................................ 10
2.2.1
Association rules ............................................................................................. 10
2.2.2
Apriori ............................................................................................................ 10
2.2.3
Quantitative association rules .......................................................................... 11
2.3
Rule interestingness ................................................................................................ 11
2.3.1
Evaluating unexpectedness .............................................................................. 12
2.3.2
User-defined interestingness evaluator............................................................. 12
2.3.3
Using users’ interactive feedback .................................................................... 12
2.3.4
Selecting an interestingness measure ............................................................... 13
2.3.5
Using click-through data ................................................................................. 14
2.4
Visualization........................................................................................................... 15
2.4.1
Two-dimensional matrix.................................................................................. 16
2.4.2
Directed graph................................................................................................. 16
2.4.3
Other techniques ............................................................................................. 16
2.4.4
Data visualization............................................................................................ 17
2.5
3
Problem statement..................................................................................................... 2
1.2.1
1.3
2
Background .............................................................................................................. 1
Mining survey data ................................................................................................. 17
Design of the system ...................................................................................................... 21
3.1
General vision......................................................................................................... 21
4
5
6
7
3.2
Choice of a mining technique.................................................................................. 21
3.3
System requirements ............................................................................................... 22
3.4
Perceived complexity and workflow........................................................................ 26
3.5
System architecture................................................................................................. 28
3.5.1
Application server ........................................................................................... 29
3.5.2
Data mining engine ......................................................................................... 30
3.5.3
Rule server ...................................................................................................... 31
3.5.4
Control module ............................................................................................... 33
3.5.5
Discussion....................................................................................................... 33
Interestingness models.................................................................................................... 35
4.1
Impact .................................................................................................................... 35
4.2
Pruning trend rules.................................................................................................. 36
4.3
Diff vs. p-value....................................................................................................... 36
4.4
Collective interestingness........................................................................................ 38
4.5
Personal interestingness .......................................................................................... 42
4.6
Combined relevance................................................................................................ 42
4.7
Hotness................................................................................................................... 44
Rule visualization ........................................................................................................... 45
5.1
Displaying multiple rules ........................................................................................ 45
5.2
Data visualization for individual rules ..................................................................... 48
Results and discussion .................................................................................................... 50
6.1
Simplified architecture............................................................................................ 50
6.2
User interface.......................................................................................................... 52
6.3
Similarity between rule attributes ............................................................................ 55
6.4
Comments mining................................................................................................... 56
6.5
Streamlined workflow............................................................................................. 56
6.6
Difference value in the GUI .................................................................................... 57
6.7
Interestingness ........................................................................................................ 58
6.8
Visualization........................................................................................................... 59
6.9
General remarks...................................................................................................... 60
Conclusions.................................................................................................................... 61
References ............................................................................................................................. 63
1 Introduction
The problem of analyzing large amounts of data to find hidden and previously unknown information is commonly referred to as data mining. Association rule mining is an unsupervised
form of data mining that deals with locating relations between items in a relational database.
There are numerous areas where this type of mining can be utilized. In this work, however, we
will focus on business applications of association rule mining, and specifically on mining customer satisfaction data.
Data warehousing in one form or another is today employed in most businesses, and the idea to
go beyond using the data for purely operational purposes and start learning from it is not only
logical, but also realistic as the amounts of gathered data increase. On the one hand, it is always
useful to gain new general insights into customer behaviour. On the other hand, profit can be
gained from being able to notice and react to specific changes in customer satisfaction early on.
The remainder of this chapter provides the necessary background on data mining, explains the
purpose of the current project, and gives an overview of the rest of the thesis.
1.1 Background
This section gives a short introduction to data mining in general and association rules
in particular in the extent necessary to understand the problem statement.
1.1.1 Data mining
Data mining is the exploration and analysis of large quantities of data in order to discover relevant information. The term “knowledge discovery in databases” is also used, and data mining in
this context is defined as “the science of extracting useful information from large data sets or
databases” [17]. As obscure as it sounds, data mining employs a wide range of techniques and
has a variety of practical applications.
In [8], Berry and Linoff suggest that many practical problems can be phrased in terms of the
following six major tasks that can be solved with data mining:
·
Classification, or examining the features of a presented object and assigning it to a class
from a pre-defined set. The input of a classification task is a set of classes and a set of
pre-classified objects (the so-called training set).
·
Estimation assigns objects with values for some variable that is typically continuous.
Examples include estimating a response probability, price, income or number of children in the family.
·
Prediction resembles classification and estimation, the major difference being that the
objects are classified or estimated according to some predicted future behaviour or
value.
·
Affinity grouping is used to determine which things go together, for example which
goods are bought together in a supermarket.
·
Clustering, or segmenting a heterogeneous input population into a number of more homogeneous subgroups. Clustering resembles classification, but it does not require predefined classes or training examples.
·
Profiling, i.e. describing the data in a way that increases our understanding of the factors or processes behind the data. This is useful because a good enough description of a
phenomenon may often suggest an explanation for it as well.
1
CHAPTER 1. INTRODUCTION
Directed data mining attempts to explain, categorize or estimate a particular target field,
whereas undirected data mining works without the use of a specified target field or a predefined set of classes. Classification, estimation and prediction are examples of directed data
mining, while affinity grouping and clustering are undirected data mining tasks. Profiling can be
directed or undirected.
The terms supervised and unsupervised learning are related to directed and undirected mining.
Supervised learning uses example inputs and outputs to learn how to make predictions, while
the purpose of unsupervised learning techniques is to find structure within a set of data where
no one piece of data is the answer [30].
1.1.2 Association rules
A prototypical example of affinity grouping is market basket analysis in the retail business.
Market basket analysis is used for finding patterns in the purchase behaviour of customers by
looking at which products are purchased together. For example, this can be used to optimize
selling efforts by arranging such products together physically in a store or by presenting a selection of popular related items to the user who buys a product in an electronic store.
A typical output of market basket analysis can be stated in form of a rule. For example, “People
who buy product A also buy product B”. Such rules are called association rules and were introduced by Agrawal, Imielinski and Swami in [2]. In [3], Agrawal and Srikant developed the Apriori algorithm for mining this type of rules.
In [5], Aumann and Lindell introduced the concept of a quantitative association rule later labelled as impact rule in [35]. In an impact rule, the rule antecedent describes a subset of the
population, and the consequent consists of one or more quantitative attributes that have an unexpected mean value for this subset. For example:
Gender = Female => Mean wage = $7.90 (Overall mean wage = $9.02)
An extension of the Apriori algorithm capable of finding impact rules was suggested in [5].
For a more detailed survey of research within association rule mining, consult [6] and [16], as
well as sections 2.2 and 2.5.
1.2 Problem statement
This section outlines the purpose and scope of the current thesis. The focal points and
limitations of the project are listed in the corresponding subsections.
Data mining is generally regarded as an interactive process, requiring an analyst to repeatedly
calibrate and run the mining software and thereupon analyze and process the results. While association rule discovery is a relatively well-studied area, making the results useful is much less
so. As such, the huge benefits of data mining are not directly available to end-users.
1.2.1 Purpose
The objective of this thesis is to develop a practical framework for mining quantitative association rules, and to give concrete recommendations on how to present the discovered rules in a
useful and apprehensible way to end-users without technical background in data mining. Ideas
suggested in the current paper should be complete enough to enable implementation of a functional data mining system.
The setting is assumed to be an application with a large and continuously growing database with
millions of records and a set of users regularly using the application on a daily or weekly basis.
More specifically, the database contains demographic, behavioural and satisfaction data about
2
1.3. REPORT OUTLINE
clients, while users of the application represent the business on different levels. Target association rules are supposed to link client segments with satisfaction values.
We are interested in using the results obtained from data mining over such a database in the
application. Topics to study and the most immediate considerations are listed below.
1. Development of a sensible workflow in the abovementioned setting. Frequent users
could be given the ability to mark, hide and track rules. In particular, effort needs to be
put into preventing the same rules from being shown time after time to such users.
a. Provide a design for the recommended solution. The solution should be scalable
with both the number of users and the number of rules.
b. For a better presentation of results, a scheme for converting a rule to a regular
sentence should be suggested and suitable graphs used to visualize each rule.
2. System architecture for the whole association rule mining system. The suggested architecture should make a solid ground for a real implementation.
3. If and how can explicit and implicit user feedback be used to make results more interesting, for instance by hiding, emphasizing or reordering rules that are shown.
4. Methods to display a set of association rules. A list of hundreds of rules in text format
may deter many a user. Perhaps a tree or some other graphical format can be used to
convey the same information, yet in a more user-friendly way.
1.2.2 Limitations
No matter how accurate a data mining system is and how intuitive the results are, they are of
little use unless the organization that uses the system is determined to continuously work with
these results and act accordingly. Methodology of adjusting the target organization to make
effective use of data mining is outside the scope of this thesis.
Data collection methods are not discussed here. We assume that the available data is suitable for
mining, is not significantly biased and represents the underlying data set well enough.
Privacy and ethical concerns associated with data mining are not taken into consideration in this
thesis. It is assumed that data for mining is collected with respect to applicable legislation.
Throughout the thesis, the discussion will be held on a general level wherever possible. However, the solution as a whole is in the first place intended to be used in the setting as described in
the section Purpose. In particular, this thesis concerns quantitative association rules specifically
and not all types of association rules.
1.3 Report outline
The rest of the document is organized as follows.
Chapter 2 introduces relevant existing work that influenced the design decisions in the current
project. It includes a summary of data mining concepts, techniques for association rule mining,
approaches to estimating interestingness of rules and known rule visualization techniques.
A system design for the suggested rule mining solution is presented in Chapter 3. A suitable
mining technique is chosen, important requirements on the target system are listed, and a workflow suitable in this setting is described.
Chapter 4 introduces and motivates two models for rule interestingness calculation. For a given
rule, its impact is based on the difference between the observed average satisfaction and the
expected satisfaction value as well as on the number of responses the rule is based on. The socalled hotness is a user-specific value combining statistical significance of the rule with collective intelligence of all users and personal preferences of the current user.
3
CHAPTER 1. INTRODUCTION
In Chapter 5, the segment-to-rule visualization scheme for presenting multiple rules in a compact and informative graphical format as an alternative to a list of rules in text form is introduced and analyzed. Also, data visualization for individual rules is discussed.
The solution implemented in practice is described in Chapter 6. Identified issues and possible
remedies are discussed. Among other topics, the use of the suggested interestingness measures
is discussed and ways to further improve the implemented visualization tools are presented.
Chapter 7 wraps up the thesis, compares the achieved results with the goals of the project and
points out directions for future work.
4
2 Related work
In the following sections, relevant existing work that influenced the design decisions in
the project is summarized.
The summary starts with a brief explanation of popular data mining concepts, such as decision
trees, clustering, naive Bayes classifier, nearest-neighbour methods, and support vector machines. The concept of collective intelligence is introduced and a model for applying it is outlined. Further, association rules, quantitative association rules and ways of mining them are
discussed, followed by a review of ways of measuring interestingness of association rules and a
survey of known rule visualization techniques. Using click-through data for evaluating and
learning retrieval functions is looked upon. A separate subsection is dedicated to research on
mining categorical-to-quantitative impact rules from survey data.
2.1 Data mining concepts
2.1.1 Decision trees
A decision tree is a structure that divides up a collection of observations into successively
smaller sets by applying simple decision rules. With each division, the sets become more homogeneous with respect to a predefined target variable. Figure 1 shows a simple decision tree that
classifies fruit type by colour, shape and other parameters. Thus, fruit type is the target variable
in this example.
Colour = green
yes
no
Colour = red
no
Diameter > 6in
yes
no
Shape = round
Diameter > 2in
no
yes
no
Banana
Diameter > 4 in
Has stone
no
Lemon
yes
Apple
yes
no
yes
Grapefruit
Grape
Cherry
yes
Diameter > 2in
no
yes
Grape
Apple
Watermelon
Figure 1. Example of a decision tree.
The target variable is usually categorical, and the tree is used either to calculate the probability
that a given record belongs to a specific class, or to classify the record by assigning it to the
most likely class. It is also possible to use decision trees to estimate values of continuous variables, but using decision trees in this context is not usual.
Decision trees need to be trained. To build a decision tree with respect to a given target variable
from a set of pre-classified examples, a recursive best-split approach is used. In each step, the
set is divided into two subsets such that a single class predominates in each group. When
searching for a binary split on a numeric input variable, each value that the variable takes in the
training set can be treated as a candidate split value and the final split takes the form X < a .
When splitting on a categorical value, it is typical to either split on a belongs-to-a-specific-classor-not basis, or to group together classes that predict similar outcomes. Creating a new branch
for each class is not usually used, as it makes the tree more complicated and, as there are fewer
training records available at each node on the lower levels of the tree, often makes further splits
less reliable.
5
CHAPTER 2. RELATED WORK
There are several ways of choosing the best split. The Gini criterion is based on the probability
that two items chosen at random from the same population belong to the same class. The score
of a split is the sum of the scores of each child node multiplied by the proportion of records
belonging to it. Another popular criterion is entropy reduction produced by the split.
A decision tree keeps growing as long as new splits can be found that separate the training set
into increasingly pure subsets. The resulting tree can not only become very complex but also get
optimized for the training set, which can increase the number of errors when new data sets are
classified. To address this problem, the final tree is usually simplified through pruning, a process that eliminates the least stable splits in the tree. CART and C5 are the most popular pruning
techniques [8].
The strongest advantage of decision trees is that they are easy to read and understand, and it is
straightforward to motivate their predictions. Further, decision trees require relatively little data
preparation. There is no need to normalize data or to remove null values. Decision trees have
been used in a wide variety of applications, such as customer profiling, financial risk analysis,
assisted diagnosis, and traffic prediction to name a few. In practice, an automatically generated
decision tree is often used by domain experts to better understand the key factors behind the
studied phenomenon, and helps to further refine research focus [30]. Because decision trees are
a combination of data exploration and modelling, they can be used as a first step in the modelling process even when the final model will be built using other techniques.
See Berry and Linoff [8] for a deeper review of research on decision trees.
2.1.2 Clustering
Clustering essentially means partitioning data into groups of items that are similar to each other
in some way. To be able to cluster a data set it is necessary to have a similarity measure, or a
distance function returning a numerical distance value between two records.
Hierarchical clustering, also known as agglomerative clustering, is a simple clustering method.
It works by continuously merging the two most similar groups. Each of these groups starts as a
single item, and the closest groups are merged together until there is only one group left.
One of the most commonly used clustering algorithms is k-means clustering. The algorithm
starts with randomly placing k centroids, i.e. points representing the centre or each cluster, and
assigns each record to the nearest centroid. After that, these centroids are replaced with the average of all the points assigned to them, and the assignments are redone. This process repeats
until the assignments stop changing.
Like decision trees, clustering can be used in order to gain an understanding of a given data set.
Unlike decision trees, clustering does not explain why data records belong together other than
that the distance between them is small enough. Also, decision trees maximize the leaf purity of
a target variable, while clustering only uses distance between data records.
A popular application of clustering is customer segmentation [8]. Other uses include similarity
search, pattern recognition, trend analysis and classification [1].
2.1.3 Naive Bayes classifier
The naive Bayes classifier is based on Bayes’ theorem and assumes that the presence or absence
of features of a class is unrelated to the presence or absence of other features. More precisely,
the probability that the input belongs to a specific class is calculated according to the formula:
n
p(C F1 ,..., Fn ) =
p(C ) × Õ p (Fi C )
i =1
p(F1 ,..., Fn )
6
2.1. DATA MINING CONCEPTS
The classifier uses supervised learning. Among the advantages of the model is that it is simple
to understand, training is straightforward and can be incremental, and that the classification
process is fast. The downside is the assumption that all features are independent.
The naive Bayes classifier is easy to construct and it reportedly has surprisingly good performance in text classification [38], even though the conditional independence assumption is rarely
true in real-world applications. On the other hand, naive Bayes has been found to produce poor
probability estimates [7].
2.1.4 Artificial neural networks
An artificial neural network is a computational model based on the biological model of how a
brain works. It consists of artificial neurons that are connected to each other by synapses. Typically, neural networks are adaptive, i.e. they change their internal structure based on the examples fed into the network during the learning phase.
There exist various kinds of artificial neural networks. The general idea is well illustrated by the
feed-forward multi-layer perceptron network. The basic structure of such a network is shown in
Figure 2. The network maps a layer of input variables to a layer of output variables using hidden
neuron layers. Each synapse has an associated weight. The outputs from one set of neurons are
fed to the next layer through the synapses. The higher the weight of the synapse leading from
one neutron to the next, the more influence it will have on the output of that neutron. Outputs
from the different synapses are typically combined using a sigmoid-type function. For example,
the combined input can be defined by the formula tanh(å xi × wi ) , where xi are the values at the
input synapses, and wi are the corresponding weights.
Hidden 1
Input 1
Output 1
Hidden 2
Input 2
Hidden 3
Output 2
Input 3
Hidden 4
Figure 2. Basic structure of an artificial neural network.
Neural networks can be used as both a directed and undirected learning method [6]. With supervised learning, the network must be trained on a number of examples. This can for example be
done using back-propagation [30]. The input values are fed to the network, the outputs are compared with the output values from the example, and the weights are adjusted by going back in
the network to make the result close to the right output. It is important to understand that the
network will only be as good as the training set used to generate it, and that it needs to be retrained in order to keep it up-to-date and useful [8]. In the unsupervised setting, the network
adapts itself to the input data. This is used for clustering and dimensionality reduction. Tasks
that can be solved with supervised learning include classification and prediction problems, pattern recognition and function approximation [6].
Many examples of successful application of artificial neural networks are known [8], and the
concept is popular. One major issue with this technology is that it is difficult to explain the reasoning that happens inside the network. A trained neural network can be seen as a black box or
an oracle giving answers without explanation. In many business situations this may be considered a problem.
7
CHAPTER 2. RELATED WORK
2.1.5 Nearest-neighbour techniques
Nearest-neighbour techniques are based on the concept of similarity. Another closely related
concept is memory-based reasoning, which is a nearest neighbour technique that uses analogous
examples from the past to make predictions about new situations, similarly to how humans approach new problems by using experiences and memories from the past. Memory-based reasoning can be used to solve various estimation and classification problems [8].
To be able to employ memory-based reasoning, it is necessary to have a distance function capable of determining distance between two records, and a combination function capable of bringing together results from several neighbours to produce a result for the given query.
For example, if a database of prices for various wine sorts, a way of calculating a distance between two wine sorts, and a way of combining prices of several wine sorts are available, it is
possible to predict prices for arbitrary sorts of wine. The quality of prediction will of course
depend on how well the database represents the space of the possible wine sorts as well as on
the quality of the two functions.
Memory-based reasoning is a highly adaptive technique. There is usually no need of explicitly
teaching the system. Simply incorporating new data into the historical database is enough. The
downside is that classifying new records can be resource-intensive as it typically requires processing all available historical data. Also, finding good distance and combination functions can
be difficult [30]. An important strength of memory-based reasoning is that the results it suggests
are easy to understand and interpret.
k-Nearest neighbours (frequently denoted as k-NN) is an example of such a technique. To make
a prediction for the input record, k nearest records from the historical database are found and the
average of their corresponding values is returned. Clearly, such a simple calculation can result
in low quality predictions when the nearest neighbours are far away from each other. Instead of
averaging the values, they can be weighted in using a weight function that is inversely proportional to the distance, thus leading to the principal calculation in the form:
å xi × w(di )
å w(di )
In the formula above, i iterates over the k nearest neighbours of the input record, xi is the value
of the corresponding neighbour, and di is the distance from a neighbour to the input record. The
technique is illustrated in Figure 3.
nearest
neighbours
x2
x1
x6
d2
d1
d6
d3
x3
d5
x5
d4
x4
Figure 3. Using 4-nearest neighbours to determine a target value.
2.1.6 Support vector machines
Support vector machines, or SVMs, are a popular technique for solving classification tasks. In
its basic form, an SVM takes a set of n-dimensional vectors, each of which belongs to one of
two predefined classes. Further, a hyperplane separating the vectors of the two classes is con-
8
2.1. DATA MINING CONCEPTS
structed, such that the margin between the two subsets is maximized. When that is achieved,
new vectors can be classified according to which side of the hyperplane they belong to. The
mathematics behind this is relatively simple and employs basic linear algebra and multi-variable
calculus (in particular, the Lagrange method for locating extreme points of a function is used).
It can be shown that the arising Lagrange optimization problem can be reformulated (the socalled Wolfe dual), and in the equivalent formulation it becomes clear that only the training
points that are closest to the optimal hyperplane are actually necessary, while the rest of the
points can be left out of the training set [11]. These points are called support vectors (see Figure
4), hence the term “support vector machine”.
Support vectors
Support vectors
Margin
Figure 4. The optimal hyperplane dividing two classes of vectors.
The idea above applies to the simplest case – a linear support vector machine and separable data
(training set for which the optimal hyperplane can be found). In the case of non-separable training data, it is possible to relax the constraints of the optimization problem by introducing positive slack variables that account for the possible training errors [13].
In most practical situations, the decision function (the function deciding which of the two
classes an input vector belongs to) is not linear, i.e. the training data can not be separated by a
simple hyperplane. In this case, it may be possible to map the training data to another space
where the two classes can be separated linearly.
The question remains how to find such a mapping. Another difficulty is that computations in a
higher-dimensional space are more computationally expensive. The observation that calculations in an SVM only involve dot products of the input vectors suggests that if a function of the
form K(xi, xj) = Φ(xi) ∙ Φ(xj) can be found, where Φ(x) is the mapping from the original space to
the new one, then the original dot products can be replaced by K(xi, xj) and the new SVM will
work roughly as fast as the original one without explicitly knowing what Φ(x) is. This is the
essence of the so-called kernel trick [10]. Finding an appropriate kernel function is difficult. In
practice, one of the well-studied kernel functions is typically chosen.
In the case when it is necessary to distinguish between more than two separate classes, several
approaches can be used. Firstly, an SVM for each class can be trained that determines whether
the input vector belongs to the corresponding class or not. Secondly, an SVM can be trained for
each pair of classes, and all outputs compared. Finally, it is possible to modify the SVM to allow classifying multiple classes at once [29].
More detailed information about support vector machines can be found in the excellent tutorial
on the basic ideas behind SVMs by Burges [11].
2.1.7 Collective intelligence
In [4], Alag emphasises the growing importance of the user-centric approach in application construction. One of the seven principles of Web 2.0, harnessing collective intelligence is by many
considered to be the heart of this approach.
9
CHAPTER 2. RELATED WORK
In essence, collective intelligence of users is the intelligence that can be extracted from the collective set of interactions and contributions made by the users, and, even more importantly, the
use of this intelligence to act as a filter for what is valuable in the application for a user. This
filter takes into account a user’s preferences and interactions to provide relevant information to
the user.
Alag suggests the following model for applying collective intelligence:
·
Learn about each user through individual interactions and contributions.
·
Learn about all users in the aggregate through their interactions and contributions.
·
Build models that can recommend relevant content to a user using the information
learnt about the user and the users in aggregate.
2.2 Rule mining
Here, several types of association rules are discussed together with mining algorithms. The section starts with categorical association rules and a summary of the Apriori algorithm, one of the
most widely known rule mining algorithms. Further, quantitative association rules are introduced and a corresponding three-stage mining process is explained. A closely related issue,
namely the application of quantitative association rules to mining survey data, is discussed separately in Section 2.5. Among other issues, ways of generating time-related and trend rules are
highlighted.
2.2.1 Association rules
Association rules were introduced by Agrawal, Imielinski and Swami in [2]. Let I be a set of
items and D a set of transactions where each transaction T Í I . An association rule is an implication of the form X Þ Y , where X Ì I , Y Ì I , and X Ç Y = f .
The rule X Þ Y holds in the transaction set D with confidence c if c % of transactions in D
that contain X also contain Y . The rule X Þ Y has support s in the transaction set D if s %
of transactions in D contain X È Y .
In [2], the process of discovering association rules is decomposed into two steps:
·
Firstly, all sets of items with transaction support above a certain minimum support
threshold are located. These sets are referred to as large itemsets. All other itemsets are
called small itemsets.
·
Secondly, the generated large itemsets and a user-specified minimum confidence level
are used to create a list of association rules.
2.2.2 Apriori
In [3], Agrawal and Srikant published the Apriori algorithm for discovering large itemsets.
Given a database of sales transactions where each item has a binary state (present or absent), the
algorithm makes multiple passes over the data. In the first pass, the support of individual items
is calculated and all small single-item itemsets are dropped. In each subsequent pass, a list of
potentially large itemsets (called candidate itemsets) is generated from the itemsets found to be
large in the previous pass, and the actual support for these candidate itemsets is calculated. At
the end of the pass, it is determined which of the candidate itemsets are actually large, and they
become the seed for the next pass. The process continues until no new large itemsets are found.
To count candidate itemsets efficiently, a hash tree is used.
10
2.3. RULE INTERESTINGNESS
2.2.3 Quantitative association rules
A quantitative association rule, or an impact rule, is a rule in which the left-hand side describes
a subset of the population with a set of categorical variables, and the right-hand side consists of
a quantitative attribute that has an unexpected mean value for this subset. Impact rules were
introduced in [5] by Aumann and Lindell together with the corresponding rule generation algorithm. Aumann and Lindell point out that impact rules are easily understood and interpreted,
even when they describe complex relations.
In practice, information in most databases is not limited to categorical attributes but also contains much quantitative data. In [32], Srikant and Agrawal extended the categorical definition of
association rules to include quantitative data. Their idea is to build categorical events from the
quantitative data by considering intervals of the numeric values. They also provide an algorithm
which approximately finds all rules by employing a discretization technique. Clustering methods can be used to improve the partitioning of the quantitative attributes in the algorithm [39].
In [5], Aumann and Lindell generalize the categorical definition further and suggest a new definition of quantitative association rules based on the distribution of values of the quantitative
attributes. The authors note that generally speaking, an association rule is comprised of the lefthand side that is a description of a subset of the population, and the right-hand side that is a
description of an interesting behaviour particular to the population described on the left-hand
side. Thus, the general structure of a rule is “population subset → interesting behaviour”. Further, for categorical attributes, behaviour is naturally described by a list of items and the probability of their appearance. The authors argue that for a set of quantitative values, the best description of its behaviour is its distribution, and therefore choose to describe the behaviour of a
set of quantitative values by calculating their mean and variance. A subset of the population
displaying a distribution significantly different from that of its complement, either in terms of
the mean or the variance, is recognized as interesting.
Two types of rules are introduced. A (mean-based) categorical–to-quantitative association rule
is of the form X Þ MeanJ (TX ) , where X is a profile of categorical attributes (set of attributevalue pairs), TX is the set of transactions with profile X, and J is a set of quantitative attributes.
In quantitative-to-quantitative rules, instead of categorical attribute-value pairs on the left-hand
side there are triplets (e, r1, r2) consisting of a quantitative attribute and two real values denoting
the interval of allowed values for the attribute.
Aumann and Lindell suggest using a three-stage process to find impact rules. In the first stage,
all large itemsets are found, for example using the Apriori algorithm. In the second stage, the
mean value of each quantitative attribute is calculated for each large itemset. In the third and
most important stage, a lattice structure is built, linking together each itemset with all its fathers
(immediate subsets) and sons (immediate supersets). X is a father to Y if and only if X Ì Y and
Y = X + 1 . The lattice is then traversed and cases where the mean value between a set and one
of its subsets is significantly different (such sets are known as exceptional groups) are reported
as impact rules. To confirm the validity of a rule, the Z-test was employed. The authors’ model
of an exceptional group is known as the separate effects model.
2.3 Rule interestingness
Association rules represent important regularities in databases. They are found to be useful in
practical applications. However, association rule mining algorithms (for example, the Apriori
algorithm described in Section 2.2.2) tend to produce large numbers of rules most of which are
of little interest to the user. Some of the rules have little business significance, others represent
already known facts. Due to the large number of generated rules, it is difficult for the user to
analyze them manually in order to identify those truly interesting ones. Typically, statistical
measures are used as criteria for rule selection.
11
CHAPTER 2. RELATED WORK
In this section, several existing approaches to assessing interestingness of association rules are
described. All of them try to identify the most significant association rules from a set of rules
generated by a mining algorithm.
2.3.1 Evaluating unexpectedness
The approach to finding interesting rules from a set of discovered association rules taken by Liu
et al. in [24] is in analyzing unexpectedness of rules. Rules are considered interesting if they are
unknown to the user or contradict the user’s existing knowledge (or expectations).
The proposed technique is characterized by analyzing the discovered association rules using the
user’s existing knowledge about the domain and then ranking the discovered rules according to
various interestingness criteria, e.g. conformity and various types of unexpectedness.
The basic idea of this technique is as follows: the user specifies existing knowledge about the
domain and the system then analyzes the discovered rules for unexpectedness based on the
specified knowledge. To register the user’s existing knowledge about the domain, a special
specification language is used. The language is able to represent facts of three degrees of preciseness: general impressions, reasonably precise concepts, and precise knowledge. The interestingness analysis system analyzes the discovered association rules using the user’s specifications and classifies the set into conforming rules (rules that conform to the existing knowledge),
unexpected consequent rules, unexpected condition rules, and both-side unexpected rules. A
visualization system is proposed as well.
The discussion is held in terms of generalized association rules that are different from the original association rule model given in [2] in that they allow associations not only between nominally separate items, but rather between nodes in a taxonomy.
Research by Jaroszewicz and Scheffer [20] is another example of assessing interestingness via
unexpectedness with respect to a user-specified model. In their model, background knowledge
is available in terms of a Bayesian network [28]. Bayesian networks can represent intrinsic dependencies between the included variables. Another advantage of specifying background
knowledge as a Bayesian network is that such networks are relatively easy to understand when
visualized. The algorithm presented in [20] uses sampling-based approximate inference in the
Bayesian network to find the (approximately) most interesting attribute sets.
2.3.2 User-defined interestingness evaluator
In [27], Obata and Yasuda describe a data mining apparatus for finding an interesting association rule from among a large number of association rules discovered through data mining by
setting evaluation criteria on the association rules which differ depending on the user’s purpose.
This involves creating a user-defined association rule evaluator that calculates an evaluation
value for each rule. The rules are then sorted by this value and the re-arranged and limited rules
are presented to the user.
More specifically, the idea is to evaluate usefulness of rules. The authors observe that the value
of association rules varies depending on how the rules are intended to be used. They suggest
that the evaluation criterion should be based on the cost incurred upon applying the association
rule, the profit gained when the association rule holds, as well as on the confidence and support
of the rule.
2.3.3 Using users’ interactive feedback
Xin et al. suggest a different approach to evaluating rule interestingness in [37], where the problem of discovering interesting rules through user’s interactive feedback is studied. The discussion is kept generic and is held in terms of patterns.
The authors claim that in many cases other popular interestingness models, such as minimum
support constraints or rule unexpectedness with respect to a user-specified model, fail to model
12
2.3. RULE INTERESTINGNESS
the interestingness measure specified by the user really well: the minimum support constraint is
often too general to catch the prior knowledge of the user, whereas the typical unexpectedness
model requires users to construct a reasonably precise background knowledge explicitly, which
is found to be difficult in many real applications.
An alternative idea is to discover interesting patterns through interactive feedback from the user.
Instead of requiring the user to explicitly construct the prior knowledge precisely beforehand,
the user is asked to rank a small set of sample patterns according to his interest. From that feedback, a model of the user’s prior knowledge is created. Another small set of patterns is selected
thereafter, and the model is refined using new feedback. This idea is illustrated in Figure 5.
feedback
User rank Sample patterns
1
Pattern 1
2
Pattern 2
...
...
Model of the user’s
prior knowledge
re-rank
sample
Frequent
patterns
Figure 5. Model for discovering interesting patterns from interactive feedback by Xin et al.
Two tasks arise in conjunction to the proposed approach. Firstly, the model that the system uses
to build a user’s prior knowledge should be defined. Secondly, it is desirable to find a balance
between the number of sample patterns that need to be ranked by the user and the amount of
information learned from the iterative learning process.
Xin et al. discuss two models to represent a user’s prior knowledge. One is the log-linear model
that works for itemset patterns only, and the other is the more generic biased belief model that is
also applicable to sequential and structural patterns. In both cases, interestingness of a rule is
calculated with the help of a weight vector of unknown variables and a set of constraints derived
from the user’s rankings. It is further suggested that the resulting optimization problem can be
solved either by using a support vector machine or by using one of the existing mathematical
programming tools.
The system should collaborate with the user in the whole interactive process to improve the
ranking accuracy and reduce the number of interactions. The authors present a two-stage approach, progressive shrinking and clustering, to select sample patterns for feedback.
Starting with two fundamental observations (one is that since similar patterns naturally rank
close to each other, presenting similar patterns for feedback does not maximize the learning
benefit and increases user overhead; the other is that since a user generally has preference over
higher ranked patterns, the relative ranking among uninteresting patterns is not important), the
authors elaborate an approach similar to that suggested earlier in [31].
Their idea is to divide the most interesting N patterns into k clusters, and to select the centre of
each cluster to the sample feedback set. The number N initially equals the number of discovered patterns, and decreases at a constant rate with each iteration. Jaccard distance [19] is used
for clustering.
2.3.4 Selecting an interestingness measure
Numerous metrics are used to determine the interestingness of association patterns. However,
many such measures provide conflicting information about the interestingness of a pattern, and
the best metric to use for a given application domain is rarely known. All measures have different properties that make them useful for some application domains, but not for others.
In [33], Tan et al. focus on classical association rules and interestingness measures defined in
terms of frequency counts given as contingency tables. They present an overview of various
13
CHAPTER 2. RELATED WORK
measures proposed in the statistics, machine learning and data mining literature. The authors
describe several key properties one should examine in order to select the right measure for a
given application domain. A comparative study of these properties is made using more than
twenty of the existing measures. It is shown, inter alia, that most of the named measures agree
with each other when support-based pruning is used in the process of rule mining.
2.3.5 Using click-through data
In [21], Joachims proposes a method for evaluating the quality of retrieval functions that is
based entirely on click-through data, unlike traditional methods that require relevance judgements by experts or explicit user feedback. In particular, the author focuses on comparing retrieval functions used in web search engines. The key idea is to design the user interface so that
the resulting click-through data conveys meaningful information about the relative quality of
two retrieval functions.
With Joachims’ approach, the user is not required to answer any questions. Instead, the system
observes the user’s behaviour and infers implicit preference information automatically. This is
considered to be a key advantage, since click-through data can be collected easily, at low cost
and without overhead for the user.
Joachims recognizes that user feedback can provide powerful information for analyzing and
optimizing the performance of information retrieval systems. However, in practice users are
rarely willing to give explicit feedback. Moreover, especially for large and dynamic document
collections, it becomes difficult to get accurate relevance estimates, since they require relevance
judgements for the full document collection. Joachims claims that schemes for evaluating retrieval functions that only use statistics about the document collection and do not require any
human judgements can only give approximate solutions and may fail to capture the preferences
of the users.
Joachims is inspired by Frei and Schäuble, who in [14] argue that humans are more consistent at
giving relative relevance statements than absolute relevance statements. They recognize that
relevance assessments are dependent on the user and context, so that relevance judgements by
experts are not necessarily a good standard to compare against. Therefore, their method relies
on relative preference statements from users. Given two sets of retrieved documents for the
same query, the user is asked to judge the relative usefulness for pairs of documents in each set.
These user preferences are then compared against the orderings imposed by the two retrieval
functions and the respective number of violations is used as a score.
While this technique eliminates the need for relevance judgements for the whole document collection, it still relies on manual relevance feedback from the user. Furthermore, Joachims means
that although the resulting scores can be approximately the same, the perceived quality of the
result sets can be quite different. Users would click on the relatively most promising links in the
top of the list, independent of their absolute relevance. A list can have proper ordering while the
overall quality of the results it presents may be low. Other statistics, like the number of links the
user clicked on, are difficult to interpret as well. It is not clear if more clicks are due to the fact
that the user found more relevant documents, i.e. indicate a better ranking, or because the user
had to look at more documents to fulfil the information need, i.e. indicate a worse ranking.
Joachims suggests the following experiment setup for eliciting unbiased data. The user types a
query into a unified interface. The query is sent to search engines A and B. The returned rankings are mixed so that at any point the top links of the combined ranking contain almost the
same number of links from rankings A and B (the two numbers may differ by 1). The combined
ranking is presented to the user and the ranks of the links the user clicked on are recorded.
If one assumes that users scan the combined ranking from top to bottom without skipping links,
this setup ensures that at any point during the scan the user has observed almost equally many
links from the top of ranking A as from ranking B. In this way, the combined ranking gives
equal presentation bias to both search engines.
14
2.4. VISUALIZATION
The above experiment setup is a blind test transparent for the user in which clicks demonstrate
user’s relative preferences in an unbiased way. Furthermore, in the worst case, if one result set
is perfect and the other is useless, the user needs to scan twice as many links as for the better
individual ranking, so the usability impact is relatively low.
Joachims analyzes the statistical properties of the click-through data generated according to his
experiment setup. He makes two assumptions. The first is that users click on a relevant link
more frequently than on a non-relevant link. The second assumption is that the only reason for a
user clicking on a particular link is due to the relevance of the link, but not due to other influence factors connected with a particular retrieval function. This is reasonable if the abstract for
each link provides enough information to judge relevance better than random.
Under these two assumptions, Joachims shows that A retrieves more relevant links than B if and
only if the click-through for A is higher than click-through for B (and vice versa), i.e. if the user
clicks on the links from A more often than on the links from B.
In [22], Joachims takes this idea further and presents a method for learning retrieval functions
using click-through data.
Click-through data in search engines can be thought of as triplets (q, r, c) consisting of the query
q, the ranking r presented to the user, and the set c of links the user clicked on. As such, clickthrough data does not convey absolute relevance judgements. However, partial relative relevance judgements for the links the user browsed can be extracted from such data. For example,
if a user after performing a search query is presented with a list of 10 links, and subsequently
clicks on links 1, 3 and 7, it is logical to conclude that link 3 is more relevant to the user than
link 2, and that link 7 is more relevant than links 2, 4, 5 and 6.
This idea is formalized in the following algorithm for extracting preferences from click-through:
for a ranking (link1, link2, …, linkn) and a set C containing the ranks of the clicked-on links,
extract a preference example linki <r * link j for all pairs (i, j ) such that 1 £ j < i , i Î C and j Ï C .
As this type of feedback is not suitable for standard machine learning algorithms, Joachims
derives a new learning algorithm that can be trained with this type of relative feedback.
For a query q and a document collection D, the optimal retrieval system should return a ranking
r * that orders the documents in D according to their relevance to the query. Joachims uses
Kendall’s τ to compare the observed ordering and the ideal one. In total, he arrives at a convex
optimization problem similar to what support vector machines are designed to be able to solve.
In practice, the ideal ordering r * is not observable, but it can be inferred from the observed
preferences. Joachims adapts his optimization problem and uses a support vector machine to
learn a ranking function. Joachims’ experimental results show that the algorithm performs well
in practice, successfully adapting the retrieval function of a meta-search engine to the preferences of a group of users. In particular, in his experiments the ranking function trained on
slightly over a hundred observations outperformed Google.
2.4 Visualization
Several ways of visualizing association rules are presented in this section, together with
a brief account of data visualization and visual data mining.
A lot of research has been put into visualizing classical association rules of the type {Xi}→Y,
where Xi are antecedent items and Y is the consequent item. Typically, at least the following
five parameters are visualized: sets of antecedent items, consequent items, associations between
antecedents and consequents, as well as the support and the confidence of the displayed rules.
The two prevailing approaches to visualizing association rules are the two-dimensional matrix
and the directed graph [36].
15
CHAPTER 2. RELATED WORK
2.4.1 Two-dimensional matrix
In a two-dimensional association matrix, the antecedent and the consequent items are positioned
along the two axes of the graph, and an image in a cell depicts an association rule that links an
antecedent with the corresponding consequent. Various attributes of the image correspond to
various rule parameters, such as the support and the confidence. In Figure 6a, the association
rule B→C is shown, and the support and confidence values are shown as columns positioned on
top of each other. The two values are mapped to the height of the column segments.
To visualize rules with several antecedents, antecedent sets instead of single items can be put on
the one axis, as it is done in the commercial data mining product SGI Mineset. The situation is
illustrated in Figure 6b. It is clear to see that this graph becomes more cumbersome as the size
of the antecedent sets grows.
2.4.2 Directed graph
When visualizing association rules in a directed graph, antecedent and consequent items are
depicted as nodes, and the associations are represented as edges. For rules with multiple antecedent items special types of edge arcs are used. See Figure 6c for an illustration. The main difficulty with using directed graphs for rule visualization is that the graphs can become complicated
even with a small number of nodes and edges [15].
A
B
C
A'
(a) B→C in a 2D matrix
(b) A+B→C in a 2D matrix
B'
(c) A→C, B→C and
A'+B'→C in a directed graph
Figure 6. Conventional ways of visualizing association rules.
2.4.3 Other techniques
Instead of using the tiles of a two-dimensional matrix to show item-to-item association rules,
Wong et al. in [36] use a matrix to depict the rule-to-item relationship. The rows of their matrix
represent items and the columns represent item associations. Antecedent items and the consequent for each column (i.e. rule) are represented by blocks of two colours. The confidence and
support levels of the rules are given by bar charts in different scales at the top of the matrix. An
illustration is given in Figure 7.
The main advantage of this approach is that it allows rules with many antecedent items, while
the identity of individual items within an antecedent group is clear. Also, no screen swapping,
animation, or human interaction is required. All the metadata appear as separate columns at the
edge of the matrix in a way that makes the rule clearly visible (in contrast with the standard
two-dimensional matrix technique where rule columns can easily overshadow each other).
In [9], Blanchard et al. argue that most known representations suffer from a lack of interactivity,
meaning that knowledge discovery should be considered not from the point of view of a mining
algorithm but from that of the user’s. Inspired by research on users’ behaviour in a knowledge
discovery process and by cognitive principles of information processing in the context of decision models, the authors suggest that when faced with a large amount of information, a decision-maker focuses attention on a limited subset of potentially useful data and changes the sub-
16
2.5. MINING SURVEY DATA
set of focus on until a decision is reached. Although based on this potentially interesting insight,
the visualization scheme suggested by Blanchard et al. appears to be way too experimental to be
useful in practice.
Figure 7. Visualizing rule-to-item relationships as suggested by Wong et al.
2.4.4 Data visualization
Advances in technology stimulate research on more advanced visualization environments. Nagel et al. also recognize the importance of interactivity and present a system for visual mining of
data in virtual reality in [25]. The system includes several data exploration tools for visualizing
four to five data set variables in a three-dimensional space.
There are two principal views of the role of data visualization tools. In [1], Aggarwal suggests
that such tools should be an integrated part of the data mining process and that they should be
used for leveraging human visual perceptions on intermediate data mining results. According to
the other perspective, data mining can use information visualization technology for an improved
data analysis [23].
2.5 Mining survey data
In [6], Bennedich sets out to find an efficient automatic method of finding interesting trends and
relations in a survey database, a large data set containing demographic, behavioural and attitudinal data. Attitudinal values are responses to Likert type questions (such questions allow the
respondents to state their opinions using a scale) and are in effect ordinal data. Each record in
the database corresponds to a specific business unit (a single hotel or an individual store for
example). He concludes that Aumann and Lindell’s impact rules is an appropriate concept for
this setting, as mining such rules requires relatively little involvement from the user, no initial
idea of what to look for is necessary, and it yields results that can be readily understood and
interpreted by non-technical users.
Itemsets with small support
Bennedich proposes several important modifications to the mining method by Aumann and
Lindell [5]. In the context of Bennedich’s work, the term itemset corresponds to a customer
group and the support of an itemset is defined as the number of customers in that group. It was
desirable to be able to find very specific rules, and the minimum support level was thus suggested to be as low as two responses. To make this realizable is practice such small groups are
17
CHAPTER 2. RELATED WORK
looked for only within individual business units. In the first stage of the algorithm, it is suggested to use the Apriori algorithm to find large itemsets within each unit.
Restrictive model for rule unexpectedness
When all large itemsets have been found and the statistical measures for them have been collected, they will be used to form impact rules. A lattice structure is built that links together each
itemset with its fathers and sons. However, Bennedich argues that the separate effects model for
finding exceptional groups (see Section 2.2.3) is inadequate, as when a group consists of more
than one category and no category is completely dominant, the group mean is in general not
expected to equal the mean of any of its fathers.
Bennedich looks at the combined effects model by Chen [12], where the expected value μG for
the cell G is calculated according to the following formula:
lf = x
mG =
å lH
H ÌG
G
lG = x - m G
However, this model fails to calculate an adequate μG if one category dominates another, if two
categories express the same thing and lack a combined effect, or if there is a combination of
these events. Bennedich suggests his own restrictive model, in which an interval of values for
the expected group mean is looked for rather than a single value. This interval is defined as
[m min , m max ] , where:
m min = min(m exp , m1 ,..., m n )
m max = max(m exp , m1 ,..., m n )
Here, μexp is the expected group mean with the combined effects model and μi are the means of
the fathers. Let μact be the actual group mean. If μact < μmin or μact > μmax, the group is considered
exceptional.
This model is more restrictive than both the separate effects model and the combined effects
model. Bennedich underscores that although his model will commit more type II errors (failure
to detect existing relations) that the two models it is based on, there is also a guaranteed decrease in type I errors (finding spurious relations).
Non-parametric statistical test
When an exceptional group has been found, a statistical test is performed to verify that the observed difference is statistically significant. Neither the Z-test nor the t-test is suitable because
of the potentially small sample sizes (both tests require the population to be distributed normally, a requirement that can be overlooked if the sample size is sufficiently large).
Instead, it is recommended in [6] to calculate the exact p-value that the difference in mean values between the subgroup and the population is due to chance. This is possible because the
quantitative attributes appearing in the consequent of impact rules can only assume a finite set
of integer values. In this case, an efficient method exists for calculating the p-value exactly. It
has a run time proportional to the sample size.
When calculating the p-value, the null hypothesis is not that the means of two populations are
equal, but that the mean of a population equals a given expected value. To accommodate for
that, it is suggested to shift the exceptional group so that its mean becomes equal to the mean of
the parent group. In the case when the exceptional group has several fathers, it should be tested
against all of its fathers, and the final p-value is chosen as the maximum of all the computed pvalues.
18
2.5. MINING SURVEY DATA
Trend rules
Various types of time-related rules can be found by defining corresponding categorical attributes. For example, if survey data describes a customer’s visit to a store, the date of visit will
probably be available. From this date, the categorical attribute Day of week (of visit) can be
constructed and corresponding rules found by the methods described above.
Another type of time-related rules is trend rules. Such rules concern contiguous time periods
from some point in time until the most recent data (contiguous time periods up to a point in the
past are found less relevant in [6]).
Bennedich adds a trend category to the data set, which can take any non-negative integer value
and shows the number of days between the relevant time parameter of the survey and the current point in time. An efficient method for finding rules concerning the last d days using the
trend category and dynamic programming is described in [6]. In a nutshell, trends are analyzed
by stepping back one day at a time.
Furthermore, a method of selecting rules from a list of trend rule candidates is discussed. If a
rule concerning data up to 14 days ago has been found, it is likely that similar rules will also be
found for 13 and 12 as well as 15 and 16 days ago, etc. Bennedich chooses to report all trend
rules that have the largest mean difference of all the rules with a p-value less than or equal to its
own. Mean difference here is the difference between the mean response for the trend period and
the mean response for the whole period.
Distance between rules
In [6], Bennedich also looks for a method of measuring distance between rules that would resemble the perceived distance. For the case of impact rules, standard distance measures based
on the number of underlying data records (such as the Jaccard distance mentioned earlier in this
text) are found of little use. Instead, the idea of calculating the distance between the rules from
the attributes constituting them is explored. The suggested definition for distance is D = 1 - s ,
where s is the similarity function defined as s = c × d . In the latter formula, c is the similarity
between the categorical attributes of the two rules, and d is the similarity between the quantitative attributes.
Bennedich wants to minimize user involvement and settles for a fully automatic solution. Measuring similarity between two quantitative survey questions is relatively simple. Pairwise Pearson correlation coefficients between the survey questions are computed for the entire set of survey responses, and their absolute values are used as similarity values (they all are between 0 and
1). However, it is noted that if two questions correlate well it does not necessarily imply that
they are perceived as being similar.
For categorical attributes, a different method is necessary. When observing generated rule sets,
Bennedich notices that the categorical attributes that are intuitively similar to each other tend to
occur with the same set of survey questions in the resulting rule sets. Inspired by this insight, he
uses Hebbian learning to build a profile for each categorical attribute that describes how
strongly it is connected to each question. To convert two such profiles to a measure of similarity, the connection weights are converted to ranks, and the Kendall τ rank correlation coefficient
is computed to compare the order in which questions are ranked. Negative coefficients are replaced by 0, and the final values are all between 0 and 1.
Rule distance and interestingness of rules
The case of several categorical attributes is not studied in general. Instead, the discussion is
confined to the setting of memory-based reasoning. In that setting, it is necessary to be able to
compare a query rule Rq with a rated rule Rr. Let 0 £ t £ 1 be the interestingness rating of Rr, let
Cr be the categorical attributes of Rr, and let Cq be the set of categorical attributes of Rq.
19
CHAPTER 2. RELATED WORK
Bennedich makes the assumption that if a set of attributes is considered uninteresting, it would
likely remain uninteresting if more attributes were added. On the other hand, if a set was rated
to be interesting, then each of its subsets, specifically each individual attribute, is likely to also
be considered interesting. It is emphasized that although not always true, this assumption helps
to create more accurate interestingness estimates as it is expected to hold in most cases.
Then, if Cr is considered interesting, and if a categorical attribute q of the rule Rq is similar to
any of the attributes of Cr, we would like to say that q is similar to Cr in order to allow Rr to
have a larger effect on the interestingness of Rq. Hence, for each q, its most similar attribute in
Cr is found, and the total similarity sq between Cq and Cr is computed as the product of all of
these similarities.
On the contrary, if Cr is considered uninteresting, it is difficult to say anything about subsets of
Cr, since it is possible that some subsets are in fact interesting, but the combination of all attributes is not. Thus, in order for Cr to be similar to Cq and thereby for Rr to have a larger effect on
the interestingness of Rq, it is necessary for each r Î Cr to be similar to some attribute of Cq.
To sum up all of the above, if s(a, b) is the similarity between two attributes, the distance D
between Rq and Rr is defined in [6] by the following formula:
D =1- c × d
d = similarity between quantitative attributes
c = t × s q + (1 - t ) × s r
sq =
sr =
Õ max(s(q, r) r Î Cr )
qÎCq
Õ max(s(q, r ) q Î Cq )
rÎC r
Bennedich points out that as more user ratings become available, it should be possible to refine
the similarity measures used between attributes by comparing the predictions of the model to
the accumulated user rankings and adjusting the similarity values accordingly.
Memory-based reasoning vs. naive Bayes vs. neural network
Further, Bennedich compares a naive Bayes classifier, a multi-layer perceptron neural network
and the memory-based reasoning model based on the distance formulas above in terms of their
ability to predict interestingness of rules from sample user rankings. He defines a synthetic user
model that assigns interestingness ranks to rules based on how many other rules with the same
attributes there are in the target database, and uses this model too see how well the three estimation techniques can learn it. In his experiments with this setting, the memory-based reasoning
model proved to be superior to the two other methods both is terms of prediction accuracy and
stability.
20
3 Design of the system
In this chapter, the choice of the data mining technique is made, the requirements on the
target rule mining system are listed and a workflow suitable for working with the system
is outlined. Finally, we describe and motivate an architectural design for the target system in the context of these requirements and the suggested workflow.
3.1 General vision
A general summary of the target system: what are we trying to achieve?
Given a data warehouse storing customer satisfaction data, it is desirable to find rules describing
exceptional, unexpected phenomena in the data. As suggested by the business context, the rules
should link customer segments with satisfaction levels.
It is necessary to develop a logical and intuitive workflow in this setting that will allow nonexpert users to work with the discovered rules. Among other things, an approach for estimating
interestingness of rules is sought for. To make rules easier to understand, some form of rule
visualization can be used.
The goal is to come up with a comprehensive solution stretching from setting up data mining
parameters to making the rules available for the end-user.
3.2 Choice of a mining technique
Here, we motivate the choice of the core data mining technique. In essence, we choose
to use an extension of Aumann and Lindell’s mechanism for mining categorical-toquantitative association rules.
Aumann and Lindell’s categorical-to-quantitative association rules (Section 2.2.3) clearly constitute a suitable and direct solution to our problem, and their three-stage mining process can be
used as is for rule mining.
Furthermore, the context of Bennedich’s work (Section 2.5) is in fact identical to ours. We too
want to discover rules with small support as well as rules describing a large population. The
non-parametric statistical test can be used to support this.
The reasoning behind the restrictive model for assessing unexpectedness of rule candidates is
convincing. One difficulty with this method is that it is hard to explain to the users what the
expected response value is: it is not trivial to understand why it is an interval of values of the
form [m min , m max ] . This issue is discussed in more detail in Section 6.6.
The concept of trend rules is highly useful. It was suggested by the project managers of the target application that this type of rules might be most interesting for the clients. However, we will
use a different method of selecting rules from a list of trend rule candidates. Bennedich reports
all trend rules that have the largest mean difference of all the rules with a p-value less than or
equal to its own. This means that several trend rules for the same itemset (excluding the trend
attribute) can be found.
Doing so has several disadvantages. To begin with, more rules are generated, which places additional load on the system. More importantly, the user needs to analyze more rules, and this is
directly incompatible with the general idea of keeping things simple for the user. Instead, we
suggest that a single trend rule should be reported per itemset. The heuristic we use for selecting
the one rule to report from a list of trend rule candidates is described in Section 4.2.
21
CHAPTER 3. DESIGN OF THE SYSTEM
On the other hand, reporting several trend instances of the same rule increases the chance that
the rule will not remain unnoticed. For example, one of the pruned instances could have a large
mean difference and would pop up once rules were ordered by difference. As such, selecting a
single rule from a series of candidates will inevitably reduce the amount of information available about the trend phenomenon. However, if this becomes an issue, the possibility of keeping
the candidate rules in the system and using them behind the scenes while still showing a single
instance to the user can be investigated.
All in all, we settle on using Bennedich’s mining scheme with an updated trend rule selection
procedure.
Theoretically, decision trees (Section 2.1.1) could also be employed to search for rules, as each
path from the root to a leaf makes up a rule. However, to build a tree, it is necessary to specify a
target variable. Decision trees are most suitable for categorical target variables, while we are
interested in rules with quantitative attributes in the consequent.
Typically, decision trees are used more in a research context, for example when an analyst expects a connection between certain factors and wants to test whether it is supported by real data.
Often it is necessary to change input parameters iteratively and study the trees over and over
again to find something interesting. This data mining project has a different focus. We want to
make rule detection an automatic process and let it find unexpected rules rather than prove or
refute an existing hypothesis.
Further, in practice decision trees are often technical in their presentation and are likely to discourage users who lack technical background, especially keeping in mind that the trees tend to
get complicated when it comes to larger data sets with many parameters. A typical tree is
“bushy” even after pruning, which makes the advertised simplicity and apprehensibility of decision trees look less convincing. In this sense, quantitative association rules are a more predictable alternative as they are consistently easy to understand.
Still, using decision trees in the target application can prove useful. Studying the associated tree
can give new insights into the nature of a rule. Also, trees can be used as an alternative way of
looking at the data. However, we choose not to focus on decision trees further in this work.
3.3 System requirements
This section lists more specific requirements on the target system from both the technical perspective and the user’s point of view. The focus is on what the system should be
able to do and how the user is supposed to interact with the system.
Below we present the initial requirements on the system that were found during requirements
elicitation and the background study carried out in an early project phase, and explain why they
were chosen as stated. Several topics to study are outlined. Emphasis lies on building a solid
architecture to base the implementation upon and on finding a method of effectively determining the relevance of the discovered rules, ideally with as little explicit input from the user as
possible.
All
Region 1
Business unit 1
Brand 1
Business unit 2
Business unit 3
Business unit 4
Figure 8. Grouping of business units in several levels.
22
3.3. SYSTEM REQUIREMENTS
1. Development of a sensible workflow in the setting described earlier. Considerations and
desired features:
a. Without loss of generality, let each satisfaction record concern a specific business unit.
Business units can be grouped in various ways. Figure 8 illustrates grouping of
properties in several levels. Following the design of the target application, let the
grouping be hierarchical up to (and excluding) the lowest level, where single units
can be included in several groups. The nodes are grouped in a way appropriate for
the business.
Further, let each user of the system have access to a single node in this structure.
Note that this does not limit the allowed combination space as new complex group
nodes can be introduced.
b. Grouping of business units should be supported.
i. Rule sets will be separate for the different levels. Following the example in Figure 8, it should be possible to find rules concerning Business
unit 1 as well as rules concerning Region 1 as a whole.
ii. Different rules will be shown to different users, depending on which
nodes in the group structure they have access to. In particular, users directly responsible for one or several business units shall be able to work
with rules associated with these units.
iii. Users at higher levels should be able to explore lower levels rules. For
example, a user who has access to Region 1 should be able to browse
rules for Business unit 1 and Business unit 2.
iv. It should also be possible to configure the display so that a summary of
the most important rules on the next lower level is shown (by nodes, or
in a combined list).
The system will be targeted at non-expert users. However, having a functioning
data mining tool opens possibilities for research, thus it is desirable to prepare for
more advanced usage scenarios.
c. Two rule viewing modes should be supported – simple and advanced.
i. In simple mode, default settings will be used to display rules.
ii. In advanced mode, the user will be able to sort, logically filter the rules
as well as adjust the settings that determine which rules are to be displayed. For example, the user should be able to choose to see only rules
concerning a certain satisfaction parameter at a specific unit.
d. To let users quickly understand the rule, its basic characteristics should be made
explicit early on.
i. For each rule, the expected and the actual satisfaction level should be
indicated.
ii. Strength (for example, the corresponding p-value) of each rule should
be indicated. In simple mode, the scale should be simplified (for example, it can be graded as “weak”, “medium” and “strong”). In advanced
mode, a numerical representation can be used.
e. To enable in-depth analysis of rules, the following features should be available:
i. Textual representation of the rule, as a means of reducing the complexity of the notion of association rules and making them more intuitive.
23
CHAPTER 3. DESIGN OF THE SYSTEM
ii. Appropriate diagrams that help visualize the rule.
iii. It should be possible to explore individual satisfaction responses behind
the rule. In particular, it might prove useful to study the text comments
corresponding to the satisfaction records, if such are available. For a
negative rule, for example, the responses without the comment should
be filtered away, and the rest ordered by increasing satisfaction value.
The idea here is that when studying the background of a low mean response, it helps to know what the people who were unhappy with the
selected parameter had to say.
iv. A list of other interesting rules should be suggested. It might be other
rules relevant in the context of the given rule, rules related to the given
rule in some way, for example concerning the same customer segment,
a similar customer segment, etc.
Depending on the nature of the underlying data, the number of possible categories,
the number of satisfaction attributes and the significance thresholds used for mining, the number of discovered association rules can vary from few to thousands. To
help cope with a potentially large number of rules, users can be given the possibility
to hide uninteresting rules and to mark specifically interesting ones.
f.
It should be possible to mark rules as uninteresting. This can be implemented as
a trash list where users can move irrelevant rules.
g. Each user should be able to form a personal list containing rules that are highly
relevant, so that these rules can be accessed at any time in order to catch up
with latest changes in watched phenomena.
i. Improvement and degradation of satisfaction levels should be indicated.
A graphical representation is suitable. A graph showing development of
the tracked value over time, coloured red or green depending on the
current trend.
ii. The point in time since when the rule is being watched should be
clearly indicated.
h. Too many rules should not be shown. However, it should be possible to see
more rules upon demand.
i.
Trashed and watched lists should not disappear the next time rules are mined. A
mechanism for distinguishing new rules from the old ones after the rule base
has been rebuilt should be developed. For example, two rules with the same client segment and with the satisfaction value on the same side of the expected
value could be considered the same.
Actionable rules are of high practical value. It is important to help users act on discovered rules more efficiently by providing basic tools for sharing and discussing
rules.
j.
It should be possible to share rules with other users/categories of users. (It remains to be decided how sharing will work between users and groups of users.)
k. It should be possible to leave comments on rules.
i. The comments will be visible to anyone who has access to the rule.
ii. It should also be possible to assign a user to be responsible for tracing a
certain rule.
When deciding on the layout of rule lists, a typical email-client folder structure can be
used as a starting point:
24
3.3. SYSTEM REQUIREMENTS
·
Inbox: Lists most relevant rules. When in advanced mode, the user should be
able to sort and filter this list.
·
Trash: Rules that the user does not want to see again are placed here.
·
Watch: All watched rules are displayed in this list independently of their current
difference from the expected value.
·
Shared: Shared rules.
·
Assigned: Rules assigned personally to the current user.
As it was mentioned earlier, the number of rules can be large. Displaying all rules to the user
may not be appropriate. In any case, it is desirable to find ways to automatically select rules that
are potentially most interesting for the given user.
2. Develop an automatic filtering and ordering mechanism for displaying rules.
a. Most relevant/interesting rules should be displayed first.
b. Statistical significance of rules (the p-value, the difference from the expected
satisfaction level, potentially other factors) should be taken into account.
c. Can information about trashed and watched rules be utilized to deduct interestingness of such rules and rules related to them?
d. Interestingness of rules on a certain level need not be the same for users at different levels. However, it is assumed that users at the same level share their
opinion of what is interesting or not.
e. Personal relevance preferences of a user can be weighted in together with other
factors when resulting interestingness of a rule is being determined, e.g. with
relevance preferences of all users at the same level.
f.
Can click-through analysis be used to further improve the filtering mechanism?
One could study which rules are clicked on, and which are not, in what order
rules are explored, and what click sequences look like.
Another way of dealing with a large number of rules is to present the rules graphically in a way
that would help to visually identify the most relevant rules at a glance. An additional possible
advantage of a graphical representation is that it may be more effective in involving the user in
rule analysis.
3. Suggest a method for displaying a set of association rules.
a. Ideally, the solution should be interactive in a way that encourages further exploration of rules.
b. If possible, a test on users should be conducted to see whether the suggested
visualization method is effective enough.
All of the above should be matched by a corresponding architectural solution.
4. An architectural design capable of accommodating the above features should be suggested. Among other factors, the following considerations should be taken into account:
a. It should be possible to schedule mining. For example, it should be possible to
configure the system to periodically mine data for a specific business.
b. The solution should be scalable with the number of users and rules.
c. The system should be adjustable to different businesses. A set of appropriate
control parameters should be suggested.
Examples of real-life figures (number of rules, number of users, etc.) that the target system
should be able to handle are given for reference in Table 1.
25
CHAPTER 3. DESIGN OF THE SYSTEM
Table 1. Reference parameters of the operational environment of the future data mining system.
Parameter
Number of rules
Number of users
Number of business units
Number of unit levels
Number of large customers (104 users)
Number of small customers (10 users)
Value
107
104
104
4
20
1000
3.4 Perceived complexity and workflow
This section discusses the complexity of a data mining tool as perceived by users. We
discuss ways of reducing the perceived complexity, in particular by creating a workflow
that users can easily associate with and where the inherent complexity of association
rule mining becomes irrelevant. When it comes to workflow and user features, this section serves as a complement to Section 3.3.
In most user-operated software, and especially in a data mining application targeted at nonexpert users, it is of crucial importance to make the working environment intuitive and comfortable for the user. As we have experienced, people often perceive data mining as complex and
intimidating. In the context of our application, we want to prompt a different view. We do not
want people to think of data mining as of something frightening, rather as of a smart technique
made for them, with the sole purpose of helping them solve their everyday concerns.
As it was mentioned earlier, our application will in the first place be used by people who are by
no means experts in data mining or advanced computer technology. An electronics retail chain
is an example of a target client. Individual stores correspond to business units in Figure 8. Store
managers are the low-level users of the system in the sense that they only work with data concerning their particular store. Impact rules will help store managers to uncover unexpected tendencies and target their improvement efforts, allowing timely and effective micro-management.
Higher level managers will work with data from combinations of individual stores and will be
able not only to study the situation at each store, but also to analyze their unit group as a whole.
At a high level, company management often understands the need for this type of tools. Their
main concern is to make sure that the technology is comprehensible and used by people at lower
levels. At the low level, we will often find people busy with everyday issues who have their
own worked-through routines, and it may not be straightforward to convince them that learning
yet another technology will be worth the effort.
In their turn, the developers of the system need to understand that introducing this type of product into a business is likely to change the way people work and even think about their work. It
will create new routines and outdate some of the existing ones. It will push people forward.
Hopefully, it will also open new possibilities to improve the service. To make this transition
possible (to begin with) and smooth (ideally), it is necessary to make the users comfortable with
the system from the very start.
Basically, we want to hide the inner complexity of the system behind an appealing and easy-touse surface. Although the product can potentially be used for advanced analytics, delivering a
research-friendly, data-intensive interface with a multitude of adjustable controls is not the way
to go. As it was mentioned under System requirements, this should rather be implemented as an
optional alternative interface. Instead, a straightforward, simple way to view the system and
interact with it should be looked for.
26
3.4. PERCEIVED COMPLEXITY AND WORKFLOW
One dimension in this simplification problem is the actual workflow. We want to support the
user in everyday activities and make the results of data mining useful. The suggested general
workflow consists of the following steps:
·
Set up data mining parameters.
·
Iteratively:
o
Mine data.
o
Explore results, act when necessary, and follow up in next mining runs.
Setting up data mining is thought of as a single-time task carried out by an administrator. In
practice, the setup will supposedly have to be tuned in several iterations as a deeper understanding of the underlying data and the needs of the users is accumulated.
The mining step itself should ideally happen behind the scenes and be transparent for the user.
From the user’s perspective, the set of rules changes automatically as new data comes in.
Whether it is done continuously or periodically is not essential. Rather, it is the last step that is
crucial, i.e. what the users can do with the rules and how the system supports them in this.
Consider a familiar analogy. When faced with the task of picking out useful insights from a
potentially interesting article, one would work through it, mark the most relevant or promising
ideas, think about these closer, maybe discuss with co-workers. When the useful ideas have
been selected, one might test them out in whatever setting they need to be applied in and see
how it goes. It might prove necessary to divide the identified tasks between several people.
Similarly to this, when a manager is faced with a number of rules discovered during data mining, he should be able to navigate through these, analyze those that seem potentially useful
closer, mark the most crucial ones, hide those that are irrelevant. Some of the rules might need
to be discussed with colleagues, either because they may be relevant for others, or in order to
gather ideas. Further, something can be done about the rules that were found crucial enough (for
example, by taking a series of improvement measures if a satisfaction parameter for a certain
customer group appears to be too low). It should be possible to track such rules to see the effect
of the applied actions. When the situation is back to normal, the rule is not tracked any longer.
Thus, to facilitate working with rules, the following should be available:
·
Rule list, in which the user can navigate, sort and filter the available rules.
·
Rule explorer, where the user is taken upon focusing on a rule (for example, via a
mouse-click in the rule list). Here, more information about the rule is provided, including graphs visualizing the rule and a list of related rules that might be interesting to investigate in conjunction with the explored rule. Data records behind the rule and other
available context information should be reachable from this view also (for example, if
survey data is mined, the surveys that the rule is based on, and especially text comments
from these, can provide a deeper insight into the nature of the rule).
·
Watched rules list, where users can move those rules from the main rule list that they
find highly interesting and want to follow up. Rules in this list remain here until they
are explicitly moved back to the main rule list, which can be done at any time. The detailed overview and the graphs for such rules should take the latest available information into account, so that these rules are always up-to-date and reflect the latest changes
in data. In this way, the effect of taken measures can be tracked closely. The list of
watched rules is user-specific.
·
Another possibility worth exploring is to leave watched rules in the main rules list but
indicate they are watched (for example, by marking them with a star – compare with the
popular concept of “favourites”). This would let the users compare watched rules with
the rest of the rules in whatever way made possible by the GUI. For example, sorting
the rules by a specific parameter makes it easy to compare starred rules with the rest.
27
CHAPTER 3. DESIGN OF THE SYSTEM
·
Trash list, personal for each user, where users can move those rules from the main rule
list that they do not want to see again. For example, rules involving meaningless combinations of attributes and rules describing well-known phenomena can be moved to the
trash list. Such rules will never be shown in the main rule list again.
·
It should be possible to leave comments on rules, and these should be available from the
rule explorer. It can be practical to let the main rule list on demand give a summary of
the latest comment updates for each of the shown rules.
·
Users at a higher level should be able to assign watched rules to users at lower levels.
For example, if a region manager notices an important rule concerning a certain business unit, it should be possible to put the rule on the watched list of the manager responsible for that business unit.
It is believed that by providing the user with the abovementioned tools we induce a workflow
that the user can accept and associate with, that puts the unfamiliar concept of impact rules in an
understandable and actionable context, and that transforms the complex data mining machinery
into an instrument of everyday use.
Another way to reduce the perceived complexity of the system is to critically review the vocabulary of the user interface and replace technology-related terms with suitable easy-tounderstand metaphors. In particular, it might be a good idea to abstain from using the term “data
mining” altogether and use a more universal language. For example, if the system mines unexpected patterns (rules), one might put it as finding new insights rather that searching through a
set of categorical-to-quantitative association rules mined from the latest available data. Figure
23 shows an example summary of the most relevant rules labelled as “insights”.
3.5 System architecture
A general architectural design for the rule mining system is described in this section.
The subsections give details about the main components of the system.
In view of the requirements listed previously, and in order to strengthen logical separation of the
different stages of the rule mining process, it is suggested to split the rule mining system into
four main components:
·
Application server runs operations specific to the host application. In particular, rule
handling GUI is implemented here.
·
Data mining engine (DME) is responsible for mining satisfaction data.
·
Rule server (RS) is the provider of mined rules and the host of the rule ranking module.
·
Control module coordinates mining of data for different businesses.
The suggested architectural design is visualized in Figure 9. The architectural diagram is explained component by component in the subsections that follow.
The general flow of events in the diagram is as follows:
·
Administrator sets up mining preferences for each customer (i.e. business) via the control module, as well as settings and scheduling for MiningDataDistributor for each customer on the application side.
·
MiningDataDistributor sends changes in underlying data to the Data fetcher on the corresponding DME server.
·
Based on scheduling preferences and available log data, Controller asks Miner macro to
start mining when the time is right.
28
3.5. SYSTEM ARCHITECTURE
·
Miner macro lets Miner micro mine rules for each unit group. At any time this process
may be interrupted by the user via Controller’s control interface.
·
When rules for all unit groups have been mined, Miner macro reports success to Controller.
·
Controller notifies Rule agent of available rule data at the DME server, and Rule agent
asks Data fetcher to fetch the rules.
·
Data fetcher receives the mined data (together with a classification of unit groups in a
form useful for the ranking module; this is to indicate which unit groups share feedback
that is relevant for interestingness) and updates the internal databases.
·
Changes in Watch and Trash are continuously reported by the main application to the
Rule agent. Ranking is updated accordingly.
·
Occasionally, the main application also reports rule usage statistics to the rule server.
·
Rule server answers rule queries received from the main application.
Rule server
Which unit groups are
considered "the same".
Data mining engine
Satisfaction DB
changes in data
and unit groups
Data fetcher
data
rules
Miner micro
groups
status
Rule DB
rules and
unit groups
Rules per unit group.
classified
unit groups
Unit groups
rules
Rule DB
CT & CAS
Usage DB
WT events
Filtering DB
Data fetcher
control
control
status
status
Miner macro
Ranker
Rule agent
control
control
query
ranked rules
changes in satisfaction data
and unit groups
Watch and Trash:
rule structure, per user.
Control module
status
Scheduling DB
rules
DME and
Rule server
log
schedule
Application server
status
Controller
Click-through data,
comments-assigned-shared
statistics, per class (or user).
MiningDataDistributor
Application
write to log
Log DB
update schedule
DB
Watch
schedule
Receptionist
log
preferences, actions
Comments
Assignments
Application DB
Sharing
status, preferences
Control GUI
Figure 9. Architectural diagram of the rule mining system.
3.5.1 Application server
Application server is responsible for running the main application in a way defined by the specification of the host system.
Ideally, in order to separate the mining module from the application using it, no modifications
should be made on the application side when integrating the new module. In reality, however, it
is difficult to achieve this at least for two reasons. Firstly, additional independent components
increase overall system complexity. Also, the host application does not necessarily provide an
29
CHAPTER 3. DESIGN OF THE SYSTEM
API allowing external access to its data. In the latter case the host system needs to be modified
anyway.
The following support is required on the application side.
·
According to the schedule specified by the administrator, the application delivers data
necessary for mining to the corresponding DME server via MiningDataDistributor.
·
Depending on data mining settings, the amount of data to be mined may vary. In order
to reduce the load on the application, it is suggested to send data to the DME incrementally, thus eliminating the need of sending over all data each time mining is scheduled.
·
MiningDataDistributor is responsible for converting the data to the format required by
the data miner. This format should be independent of the application’s data model. Support for the following activities are specifically necessary:
o
Separate demographic/behavioural attributes from attitudinal attributes and free
text data.
o
Create custom attributes. For example, if the attribute “Check-in date” exists,
but not “Check-in day-of-week”, it will be necessary to create the latter in order
to be able to discover rules concerning specific weekdays.
o
Some quantitative behavioural attributes may need to be partitioned to make the
mining more efficient. For example, the attribute “Price paid” for a service can
take a lot of values, and rather than splitting the mining on every possible value,
it is advised to partition such a question into 5-10 groups, e.g. $0-15, $15-40,
$40-60, etc.
·
There should be a GUI allowing authorized users to adjust scheduling and other settings
concerning MiningDataDistributor that are independent of the control module.
·
Watched rules, comments on rules, rule assignments (to specific users), as well as information about sharing of rules can all be primarily kept on the application side rather
than on the rule server.
·
o
In this way, this data will be available instantly for the application, and there
will be no need to query the rule server. It might be desirable to display assigned and watched rules to the user often, and no ranking of these rules is necessary as they should be few.
o
When any of the above are changed, the rule server should be notified. This
should be done instantly in the case of changes watched rules. When the rest is
changed, it may be more appropriate to bundle change notifications with clickthrough data and occasionally send this information to the rule server in form of
rule usage update blocks in order to optimize system performance.
GUI for working with rules should be integrated into the application to allow for a logical and natural workflow from the user’s point of view. This will also result in a clearer
separation of responsibilities between the different components of the system.
3.5.2 Data mining engine
The primary task of the data mining engine is to mine data. However, additional components
were suggested to ensure compatibility with the rest of the framework.
·
Satisfaction data is stored to allow incremental data transfer from the main application
to the DME. Thus, data can be mined at any time without placing additional workload
on the main application.
·
A dedicated component receives incremental updates of the satisfaction data from application servers.
30
3.5. SYSTEM ARCHITECTURE
·
When requested to do so by the control module, the DME runs data mining on available
data for the specified client, saves mined rules and reports back to the control module
when the job is done.
·
As mining itself can be time-consuming, mined association rules are stored so that they
can be re-sent if errors occur during communication with the rule server.
·
The DME returns the latest set of mined rules for the given client upon request from the
rule server.
As it was mentioned under System requirements, each satisfaction record concerns a business
unit. Candidate rules concerning different business units can be mined separately. Thus, it is
reasonable to use several independent threads to mine different units. However, candidate rules
will then be compared to see whether the described phenomena are indeed exceptional. This can
be done when candidate rules for all units have been found.
Another consideration is that business units can be arbitrarily grouped into unit groups. Arbitrary unit groups can be mined by analogy. Satisfaction records for all units within the given
unit group can be marked as belonging to this group, and then the normal mining procedure can
be used. Clearly, information about which unit groups are comparable must be sent from the
main application to allow cross-group analysis to determine whether the discovered candidate
rules are exceptional or not.
3.5.3 Rule server
Rule server is a dedicated subsystem that provides rules to the main application upon request.
Keeping in mind the potential number of users, clients and rules (Table 1), it is believed that this
design is more appropriate than keeping the rules on the DME side. Another important task of
the rule server is to rank-order rules based on rule usage data. This can be resource intensive.
Due to this and the aspiration to keep the mining components and the application mutually independent, it was decided to keep this functionality separate from the main application.
Below is a general summary of what the rule server should be able to do.
·
Retrieve mined results from the corresponding DME server when prompted by the control module.
·
Data received from the DME server consists of new rules and a classification of unit
groups. Each rule is associated with exactly one unit group, and each unit group belongs
to one class. In this way, users’ feedback can be analyzed separately for each class.
·
Receive changes in the lists of watched and trashed rules from the application and update the database accordingly.
·
For every user it shall be possible to see which rules are watched and which are trashed.
These lists should not get outdated when new rules arrive from the DME server.
·
Receive click-through and comments-assignments-sharing statistics from the application and update the database accordingly.
o
Comments-assignments-sharing statistics is information about how many times
a specific rule has been commented upon, assigned to users, and shared between users.
o
This information can be classified according to unit group class. It is also possible to store this information on a per-user basis. This way, we would be able to
make more accurate personal preference rankings, but this would also complicate the interestingness model.
31
CHAPTER 3. DESIGN OF THE SYSTEM
·
Together, data from Filtering DB (rule filtering database based on watched and trashed
rules) and Usage DB (a database with click-through and comments-assignments-sharing
statistics) can be used to determine collective and personal interestingness of rules.
·
In Filtering DB, the watched and trashed rules should be stored together with their unit
group, structure and the corresponding user id. In this way, it will be possible both to
filter rules independently for each user and to deduct collective intelligence values per
unit group class.
·
Answer rule queries from the main application. A typical query would be “show Inbox
for user X, for unit groups {Xi}”.
It is plain to see from the mining algorithm that when data is mined, rules are calculated from
scratch in the sense that no information about existing rules is used. This means that when a
new set of rules arrives at the rule server, the new rules bear no direct connection to the existing
rules. Still, information about watched and trashed rules has to be preserved and useful when
new rules arrive. Thus, it is necessary to define a rule filter equivalence criterion.
We suggest that the antecedent and the consequent together with the sign of difference from the
expected value should be used for this purpose. Thus, if a rule is put to the trash list, it will be
assumed that user means that any rule with this antecedent and this consequent that on the same
side from the expected value should be hidden. By looking at the sign of the difference from the
expected response value we separate the positive and negative phenomena. However, this
method is crude. Consider the situation when the user trashes a rule with a low difference and
then it grows remarkably in a subsequent mining. The rule will still be hidden although it might
be of interest. To solve this, the users can be given the opportunity to specify if they trash a rule
permanently or if the system should still display the new rules that are remarkably different in
statistical measures from their trashed counterpart.
For users with access to higher unit groups, the question arises whether the unit group should be
taken into account when filtering out trashed rules. We feel it should and choose not to hide
corresponding rules at other nodes. Interestingness estimation mechanisms can further ensure
that rules whose counterparts are trashed at another unit group receive a lower rank.
Trend rules require separate consideration. It is necessary to decide how the rule “unit A, floor
B, last 14 days → water pressure+” will compare to an identical one but three months later, or
to a similar rule concerning the last 7 days. One solution is to look at the presence or absence of
the trend attribute in the rule rather than its value. This also goes along the same line as the idea
to allow a single trend rule per itemset. Note that in another setting an alternative scheme may
need to be chosen.
The above is consistent with the way rules on the watched list have to be treated. Each watched
rule is tied to a specific unit group, and watched trend rules are followed up nearly continuously, so the exact number of days when the most extreme results have been observed becomes
irrelevant as compared to the number of days the rule has been watched.
As such, updated counterparts of the watched and trashed rules are not necessarily present in
new rule sets. While it is perfectly suitable for trashed rules, watched rules need to be kept upto-date. The mining algorithm should not prune rule candidates corresponding to the rules on
the watched list.
Another consideration regarding replacing the rule set is that this process should be completely
transparent to the users of the data mining module in the host application. All the necessary
updates in the internal databases and memory structures should not interrupt normal operation
of the system as seen from the user’s point of view.
To achieve this, the new structures can be prepared “offline” first and when they are ready, the
old structures of the rule server can be replaced in a single fast step. The update can also be
incremental, i.e. the structures can be updated in the order that does not affect operation of the
system. In practice, a combination of the two techniques can be most appropriate.
32
3.5. SYSTEM ARCHITECTURE
3.5.4 Control module
The task of the control module is to coordinate mining of rules for different clients. Creating a
separate component for handling this task is suggested by the potentially large number of clients. In this setting, it should be possible to specify the frequency of mining for each client, and
let the system take care of the rest of the scheduling. The control module is supposed to do the
following:
·
For each customer, hold a schedule of data mining runs.
·
For each customer, log statistics about past runs. In particular, execution time should be
recorded. This information can be used for smarter scheduling. For example, when two
jobs on the same DME server need to be executed in sequence, it might be reasonable to
prioritize the job that has previously taken less time.
·
Start DME runs according to the schedule. When notified of success by a DME server,
notify the corresponding rule server of availability of fresh data mining results. This
opens the possibility of scheduling rule server updates separately.
·
Provide a software interface for controlling the schedule and ongoing runs of the DME.
This interface can be utilized by an external GUI allowing the user to monitor and influence DME executions. It might also prove helpful to be able to steer the control
module from the main application.
3.5.5 Discussion
Here we list some of the design considerations that were not covered explicitly in the previous
sections.
·
Transfer of data for mining (between the application and the DME) and runs of the
DME are independent so that scheduling can be optimized according to the use of the
application and the use of the DME, respectively. For example, if an application instance is most heavily used during daytime, data transfer may be scheduled every night.
Runs of the DME, on the other hand, can be scheduled according to the number of all
clients that need to be mined by the current DME instance.
·
When devising the architectural design, several ways of distributing responsibility between the main application and the rule server were considered. They are illustrated in
Figure 10.
Rule server
Rule DB
Rule agent
Application server
Rule server
Rule DB
Application
Rule DB
Rule agent
Rule service
Application server
Application server
Main part
Application
Application DB
Application
Application DB
App DB
Rule DB
Figure 10. Three ways of dividing functionality between the main application and the rule server.
33
CHAPTER 3. DESIGN OF THE SYSTEM
o
One option was to move all rule-related functionality to the rule server. However, it could potentially result in too much communication between the application and the rule server, in view of a potential necessity to access lists of
watched rules often.
o
Another possibility was to integrate all rule server functionality into the application. In this case, there would be no need for external communication, and the
whole design would be simpler. However, as it was unclear how resource intensive rule ranking would be in practice, such a design could potentially result in
a system too slow for practical application.
o
Finally, different ways of sharing the responsibilities were considered. With the
current design, all ranking is done by the rule server and does not burden the
main application; yet the most immediate data (in particular, lists of watched
rules) is available on the application side.
·
The control-status connection between Controller and Miner macro allows Controller
not only to start data mining, but also to ask for status, pause, resume and cancel the
mining process.
·
The control module and all application servers must know which DME server and
which rule server they should contact. To avoid duplication of settings, let the control
module notify the corresponding application of any changes in the two addresses.
·
Security issues associated with communication of satisfaction data and mined rules between the components of the system are not looked into in any further detail, as it is implied that the components reside in an internal network of the application host. The
setup of this network should guarantee that the individual components except for the
main application are not reachable from the outside.
The suggested design is scalable in the sense that the number of DME servers, rule servers and
application servers may vary independently from one and up. However, the control module will
exist in a single instance. In theory, the components can physically reside on the same machine.
It is recommended though to run DME servers on separate physical machines since data mining
can be highly resource-intensive. Also, each DME server should carry out one mining task at a
time. Like an application server, a rule server may be configured to work with several clients.
It is understood that the design outlined in this chapter is relatively complex. Our belief is that
the complexity of this design is justified by the fact that it covers a whole range of practical use
cases. When applied in a specific setting, the design can be adapted and simplified.
34
4 Interestingness models
This chapter introduces and motivates two models for rule interestingness calculation,
namely impact that is based on diff and support, and hotness that combines statistical
significance of a rule with collective intelligence of all users and personal preferences
of the current user.
As suggested by the variety of existing approaches to rule interestingness estimation (some of
which are described in Section 2.3), it is scarcely realistic to find a single universal interestingness measure that captures all the factors relevant to rule interestingness in a perfectly balanced
way. In this chapter, we will outline the desired characteristics of the sought-for interestingness
model and suggest how these can be fit into two interestingness values.
For each mined rule, the expected satisfaction value calculated in accordance with Bennedich’s
restrictive model (Section 2.5) is known. Also, the rule’s support (the number of responses the
rule is based on) and p-value are available. Let diff denote the difference between the actual
mean value and the expected satisfaction value.
4.1 Impact
Recall that unexpectedness of rules is one of the most popular interestingness factors (Section
2.3.1) and the one explicitly named under General vision. Further, the idea of deriving interestingness by means of a user-defined evaluator (for example, as the difference between the cost
incurred upon applying the association rule and the profit gained when the rule holds) was suggested before (see Section 2.3.2). Rule impact defined below is a measure combining unexpectedness of the rule with its value in the business context.
Diff clearly gives a measure of the level of unexpectedness of a rule, especially considering the
fact that the calculation scheme of the restrictive model by design can understate this value.
Thus, a high absolute value of diff potentially indicates a highly exceptional rule.
However, high diff values are typically achieved in rules with low support, i.e. in rules concerning few respondents. Moreover, on a business scale, it is more interesting to study rules that are
representative for the whole population, i.e. rules that are based on many responses.
To account for the above, the impact of a rule can be defined as:
impact = diff × support
In order to separate the cases of positive and negative rules, it is recommended to use a simpler
definition instead:
impact = diff × support
As simple as it is, this calculation is rather powerful. It captures a business concept and gives a
concrete value that is clear and simple to understand and interpret. It is easy to compare rules by
impact and choose which to act on. Moreover, the concept of impact makes it easier to come up
with an estimation of how much effort the business is ready to put into dealing with the phenomenon the rule describes. Depending of the absolute value of impact, a target budget can be
estimated. To account for this, another calculation step can be added on top of the impact:
cost = f (impact)
Furthermore, business units can be compared by the total sum of the costs of their rules, alternatively by the total impact of their corresponding rules.
35
CHAPTER 4. INTERESTINGNESS MODELS
However, it is recognized that impact is a crude measure. It only gives a general picture and
clearly takes away the focus from local rules, i.e. rules concerning few respondents. On a single
business unit level, it can be desirable to notice and react to local changes in satisfaction. For
example, if there is a problem at a certain room at a certain hotel, it is preferable to find out
about it after it has been reported by two guests rather than two hundred. In addition, it is not
apparent that the interaction between diff and support used in the impact formula is adequate.
Also, impact is a static measure. It does not change after users’ preferences and does not adapt
to users’ behaviour. On the one hand, this stability can be considered positive. In this sense,
impact is a reliable measure. On the other hand, if a rule with a high impact represents a wellknown fact and is not perceived to be unexpected, it will still be put high in the ordered rule list.
4.2 Pruning trend rules
The concept of impact is simple and effective. It also proved to be useful for the problem of
selecting a single rule to report from a set of trend rule candidates. In Section 3.2, we mentioned
that even when fixed values for minimum support, minimum diff and maximum p-value are
specified, several trend rule candidates can be found for the same antecedent. We chose to report only one of these to the user in order to avoid the confusion anticipated if all trend rules
satisfying the threshold criteria were shown.
We rely on the intuition that in a series of rules with the same left-hand side and the same quantitative attribute that date back to different points in time, the most relevant one will have a sufficiently large impact and a sufficiently small p-value. For a trend rule, let its effect denote the
following:
effect =
diff × support
p
In our experiments with various selection heuristics (rules were randomly chosen among the
trend rules found by mining our test database; for each of the test instances, all found candidates
were presented to the user), we saw that the trend rule candidate with the highest effect almost
always coincided with the rule that would be chosen by a human expert.
Notice that using effect as an explicit interestingness measure available in a GUI is a worse
option than using impact. This is due to the fact that impact is easy to understand for an average
user, while effect includes dividing by the p-value which makes the result difficult to interpret.
4.3 Diff vs. p-value
Let us consider an alternative way of assessing rule relevance through objective values. This
time we will concentrate on the diff and the p-value.
Recall that the p-value as reported by the restrictive model is the maximum of the p-values calculated with the null hypothesis that the (possibly shifted) mean satisfaction value of the itemset
in the rule is equal to that of its father in the itemset lattice built according to the mining scheme
by Aumann and Lindell (consult sections 2.2.3 and 2.5). Somewhat simplified, the p-value is the
probability to get a subgroup from the parent population as extreme as in the observed rule provided the rule is actually valid.
The following can be said about the p-value and the diff:
·
Rules with a significant diff should be considered relevant, as well as rules with a low
p-value.
·
The two factors should be able to compete: a large diff should be able to overshadow a
higher p-value, and a low p-value should outdo a small diff.
36
4.3. DIFF VS. P-VALUE
·
The two factors must be balanced properly, as the same absolute difference in one of
them does not directly correspond to a difference in the second one, and the scales are
different (the p-value is a probability and varies between 0 and 1, while abs(diff) can
vary between 0 and the maximal allowed satisfaction value).
The above suggests that the resulting significance value could be calculated by:
relevance=
f (diff )
g ( p)
To elaborate the model further, assume that we could map diff and p separately to two relevance
values between 0 and 1, where 0 means “totally irrelevant”, 0.5 means “indefinite”, and 1
means “most relevant”. Then the product of the two mappings would be a value between 0 and
1, where 0 means “totally irrelevant”, 0.5 means “indefinite”, and 1 means “most relevant”.
When comparing rules with different p-values, the difference in the orders of magnitude is typically interesting [6], while diff compares well on a linear scale. Moreover, a cut-off point after
which all smaller p-values are considered equally good can be used for simplification.
[
] [
]
In the case of p-values, the desired mapping could look like 1...10 -3...10 -6 ® 0... 0.5...1 , or in the
generalized form: [1...mid p ...min p ] ® [0...mid...1] . Thus, in order to map the p-value to relevance, the
log function can be used first, and then the power function can be used to fit the target mapping
into the control points (1, 0 ) , (mid p , mid ) and (min p , 1) .
We arrive at the following mapping:
, x < min p
1
ìï
rel p ( x) = í
a
ïî log min p x
(
)
, x ³ min p
, where a = log (log min p mid p ) mid
This mapping is illustrated in Figure 11a. The figure corresponds to the parameter values
min p = 10 -6 , mid p = 10 -3 and mid = 0.5 .
1
0.9
0.8
0.7
0.6
0.6
relevance
relevance
0.8
0.4
0.5
0.4
0.3
0.2
0.2
0.1
0
-6
10
-4
-2
10
10
0
10
0
0
p-value
0.5
1
diff
1.5
(a) Mapping from p-value to relevance
(b) Mapping from diff to relevance
2
Figure 11. Converting p-value and diff to separate relevance values.
For diff, the situation is different. In the context of the target application, a mapping of the form
[0...x0 ...x1...xM ] ® [0...y0 ...y1...1] was found reasonable. After a certain point (x1 , y1 ) the absolute
change in difference has a relatively small impact on the relevance value, as the diff has already
become large enough. On the interval [0...x0 ...x1 ] ® [0...y0 ... y1 ] , the power function is used to fit
the curve to the control point (x0 , y0 ) , analogously to the p-value mapping described above.
The resulting formula is:
37
CHAPTER 4. INTERESTINGNESS MODELS
æ y0 ö
ì
log æ x
÷
öç
ï y æç x ö÷ log çè 0 x1 ÷ø è y1 ø
, x < x1
1
x
ï
è 1ø
reld ( x ) = í
ï y + (1 - y )lgæç1 + 9 x - x1 ö÷ , x ³ x
1
1
ç
ï 1
xM - x1 ÷ø
è
î
Figure 11b shows the mapping of diff to relevance corresponding to the control parameters
(x0 , y0 ) = (0.1, 0.5) , (x1 , y1 ) = (1.0, 0.9) and xM = 10.0 .
Finally, the total relevance value can be calculated as:
relevance= reld ( diff )× rel p ( p )
An obvious disadvantage of the suggested formulae is that it is necessary to explicitly choose
the control points. However, once these points have been chosen, the resulting relevance value
reacts in an intuitive way, boosting interestingness of rules with a low p-value and a high diff
independently, and vice-versa.
Again, this scheme is not universal either, but it serves as a unified statistical significance value
that is based on two important statistical parameters and helps to simplify the overview analysis
of the mined rules carried out by the user.
4.4 Collective interestingness
As it was emphasized in Section 2.1.7, collective intelligence is an increasingly popular concept
in application construction. Applications can be adjusted and personalized by learning from
their interactions with users.
Information about subjective interestingness of rules can be extracted in various ways from
interactions between the users and the system. However, explicitly asking the users to rankorder sample lists of rules as it was suggested in [37] (see Section 2.3.3 for a summary) is undesirable in our case. On the one hand, the aspiration is to make the ranking scheme transparent
for the user. On the other hand, asking users to specify an absolute ordering can be unreliable
[14]. Instead, there are other sources of preference information that are less explicit. For example, click-through is clearly a source of implicit preference data (see Section 2.3.5).
In this section, we will concentrate on how interestingness can be deduced from users’ moving
rules to the watched and trash lists. The two actions can be seen as classifying rules as interesting and irrelevant, respectively. Thus, each such event assigns a rule into one of the two categories. If a distance between two rules is known, a nearest-neighbour approach can be used to
deduce interestingness of all rules.
Two difficulties arise, however. First, the number of rules can be large (Table 1), and nearest
neighbour techniques are as such slow when the number of nodes grows (Section 2.1.5). Secondly, and most importantly, when data mining is re-run, the old rules are replaced with new
ones, making the old rating information useless.
To deal with the first difficulty, rules could be clustered in a way that lets quickly identify clusters of rules that are too far away from the target rule and thus need not be processed. The second problem, however, suggests that rules should not be involved in the analysis directly. Instead, a more generic structure that persists between mining runs should be utilized.
The antecedent of the rule together with the quantitative attribute can be such a structure. However, rules with positive and negative diff are likely to be perceived as principally different.
Compare the cases when a satisfaction value is found to be too high to the case when it is lower
than expected. The second case will probably be seen as more urgent, as the clients may be lost.
The first situation, on the other hand, is in the worst case a sign of an unnecessary cost, which is
38
4.4. COLLECTIVE INTERESTINGNESS
potentially not as harmful. Thus, we may conclude that it is reasonable to include the sign of
diff into the rule structure.
The next question is whether to include the values of the categorical attributes constituting the
left-hand side of the rule into the rule structure. Including categorical values enables fine-grain
interestingness judgements. At the same time, it often adds unnecessary complexity into relevance analysis. Should similar rules considering different rooms be treated differently? Should
rules concerning men and women be seen as principally different? Also, such separation calls
for a way of quantifying the difference between different categorical values. How different
should people with an annual income of 0 to X, X to Y and Y to Z be considered?
One way of dealing with this additional complexity would be to add a fixed penalty for a difference in categorical values. Instead, we choose to define the structure of rules as the categorical
attributes constituting the left-hand side of the rule, the quantitative attribute on the right-hand
side, and the sign of diff. This definition of rule structure is used further in the current paper and
is called the rule type.
The conception of rule types solves the two difficulties mentioned above. The number of rule
types is in practice significantly lower than the number of rules. In particular, note that if the
number of categorical and quantitative attributes is nc and nq respectively, and ci is the number
of possible values of the categorical attribute i, then the number of rule structures with categorical attribute-value pairs, quantitative attribute and the sign of diff NRS and the number of rule
types without categorical values NRT are:
nc
N RS = 2nq Õ (ci + 1)
i =1
N RT =
nq 2nc +1
Also, old rule types remain unchanged when new rules are mined, so the collected feedback is
still useful with new sets of rules. The solution presented in this chapter can be adapted to the
case when the rule type includes values of categorical attributes if necessary.
In [6], Bennedich studies an identical setting and advocates that memory-based reasoning is a
good choice for subjective interestingness estimation outperforming a naive Bayes classifier and
an artificial neural network. He includes categorical and quantitative rule attributes into the rule
type and elaborates a function that returns similarity between a rated rule and an arbitrary rule
based on the corresponding rule types. This definition is easy to extend to the case of signed
rule types by seeing rule types with different signs as completely dissimilar and leave the calculation unchanged for rule types having the same sign of diff.
In his experiments, however, Bennedich assumes that absolute ratings for specific rules are
available. In our setting different users can implicitly leave ratings of either 0 (irrelevant) or 1
(relevant) to potentially the same rules. In what follows, we will try to adapt our setting so that
the k-nearest neighbours algorithm can be used for collective interestingness estimation.
Here are the core properties desirable for the future calculation scheme:
·
Raw rating of a rule type (rating without impact from neighbours) should be based on
the number of times the rule type appears on the watched and trash lists. In principle,
other factors could be included here as well. We choose to leave such extensions for
later.
·
The more feedback there is for a rule type the more it should affect its neighbours.
When a rule type has two neighbours, one of which was rated by ten users and the other
one by two, the first score is probably more reliable, and should have a larger effect
than the second rule type.
·
The weighting should be optimistic in the sense that moving to the watched list should
weight more than moving to trash. The basic motivation for this idea is that if somebody thinks that a rule is highly relevant and a few more users think it is useless,
39
CHAPTER 4. INTERESTINGNESS MODELS
chances are it in fact is interesting. However, if users consistently trash a particular rule
type, it really is likely to be useless. On the whole, to trash an interesting rule is worse
than recommending an uninteresting rule.
Recall the formula for the weighted k-NN from Section 2.1.5:
å xi × w(di )
å w(di )
Here, the weight function w(d) decreases with distance.
In our setting, the algorithm should average ratings and weight in the number of votes. Let wn(n)
be the weight function of the number of responses, wd(d) be the weight function of distance, and
n(t) be the number of responses corresponding to the rule type t. Further, let d(t1, t2) be the distance between the rated rule type t′ and the target rule type t, let ratingraw(t) be the raw rating of
the rule type t, and knn(t) be the k nearest rated neighbour rule types of t. Finally, let rating(t) be
the total rating of t that includes contributions from neighbour rule types. The following formula
is obtained:
rating (t ) =
å rating raw (t ¢)× weight (n(t ¢), d (t ¢, t ))
t ¢Îknn( t )
å weight (n(t ¢), d (t ¢, t ) )
, where weight (n, d ) = wn (n ) × wd (d )
t ¢Îknn (t )
The weight function of distance wd(d) should assign significantly higher weights to nearby rule
types. The current definition is taken from [6]. Here, ε avoids division by zero and determines
the weight obtained at zero distance.
wd ( x ) = ln
1
x(1 - e ) + e
To weight in the number of responses, the following function can be used:
, n=0
ì c w (1)
wn (n) = í 1 n
î0.1 + lg c 2 n , n > 0
The weight grows with the number of responses, but the rate of growth is logarithmic. The parameter 0 < c1 < 1 determines how the weight of a rule type with no corresponding responses
compares to the weight of a rule type with a single response. The curve can be compressed or
expanded in the y-axis by adjusting the value of c 2 . Figure 12 shows wn (n) for c1 = 0.5 , c 2 = 1
and 0 £ n £ 15 .
1.4
1.2
weight
1
0.8
0.6
0.4
0.2
0
0
5
10
number of responses
15
Figure 12. Weight as a function of the number of responses.
40
4.4. COLLECTIVE INTERESTINGNESS
When determining the raw rating of a given rule type, as it was mentioned before, we would
like to boost the impact of positive feedbacks. Introduce a parameter c that determines how
much more watched is worth than trash. The following function can be used:
ì 0.5
ï
rating raw (t ) = [n = cnW + nT ] = í cnW
ïî n
, n=0
, n>0
Here, nW and nT is the number of times the given rule type appears on the watched and trash list,
respectively. Note that ratingraw(t) is a value between 0 (uninteresting) and 1 (interesting). Thus,
rating(t) is also a value in this interval.
Further, to determine the distance between two rule types, an approach similar to that described
in [6] can be used. If sc is the similarity between the sets of categorical attributes of two rule
types, sq is the similarity between the corresponding quantitative attributes, and sgn(t) is set to 1
for rule types with positive diff and -1 for rule types with negative diff, then the distance between two signed rule types can be defined by:
1
, sgn(t1 ) ¹ sgn(t 2 )
ì
d (t1 , t 2 ) = í
î1 - sc (t1 , t 2 ) × sq (t1 , t 2 ) , sgn(t1 ) = sgn(t 2 )
If the similarity between categorical attributes and quantitative attributes is known, the question
remains how to calculate the similarity between two sets of categorical attributes. Recall the
calculation scheme outlined in Section 2.5. The distance there depends on the rating of the
neighbour rule type. To overcome this instability, it is suggested to use an in-between variant of
this scheme, as if the rating was consistently neutral, i.e. the following formula is used:
æ
ö
sc (t1 , t 2 ) = 0.5 × ç Õ max(s(q, r ) r Î C (t 2 )) + Õ max(s(q, r ) q Î C (t1 ) )÷
ç qÎC ( t )
÷
rÎC ( t2 )
1
è
ø
Please note that d(t1, t2) as defined above is not necessarily a distance function in the strict
mathematical sense. It is non-negative and symmetric (with the updated formula for similarity
of sets of categorical attributes), but the triangle inequality is not necessarily satisfied (depending on how similarity between single attributes is defined). This, however, does not make it less
useful in a nearest-neighbour setting.
Finally, to reduce the complexity of the model, the number of neighbours k used in the calculations can be fixed.
Given the definitions above, a nearest-neighbour solution rating rule types given user feedback
can be implemented. As such, k-nearest neighbours is a convenient technique to use. The underlying idea is simple and easy to understand. It is especially important in a setting with inquisitive users. A result obtained with a nearest-neighbour algorithm is plain to motivate as it is just
a weighted mean of similar neighbours. Moreover, it is an online technique in the sense that
new data can be incorporated into the model at once (with a relatively straightforward implementation, the complexity of adding a feedback to the system is O(knRT), where k is the maximal
number of neighbours and nRT is the number of rule types), and the system does not need to be
retrained anew, as opposed for example to support vector machines.
On the other hand, as the discussion above suggests, in the given setting the overall level of
complexity of the model is relatively high. It is necessary to choose a function for calculating
raw rating of a rule type, to define a distance, and to build a weight value from distance and the
number of responses. When all of this adds up, the choice of the nearest-neighbour approach
becomes less obvious. Still, as we will see in the following section, once it is set up, well-tuned
and working, it can be effectively used for personal interestingness estimation as well.
41
CHAPTER 4. INTERESTINGNESS MODELS
4.5 Personal interestingness
Apart from collective users’ preferences, as it was mentioned in Section 2.1.7, individual interactions and contributions of a single user are important when a collective intelligence approach
is applied in a product. Personalization is increasingly gaining importance in data mining applications [26].
In the previous section, it was shown how feedback from many users can be used to predict
interestingness of rule types. Here, we will focus on predicting interest of a single user.
The nearest-neighbour model described above receives feedbacks and produces interestingness
predictions. In a completely analogous way, this model can predict interest of an individual user
if the only incoming feedback is the feedback from this user. In practice, however, there is a
hitch. To predict interestingness by using feedback from all users, only one such structure is
necessary. To personalize the results, a separate model for each individual user is necessary.
In the case we are studying, each feedback concerns a rule being classified as either watched or
trashed. For a single user, the number of feedbacks will be relatively low. If the number of
feedbacks available for the current user is nF , the number of known rule types is nRT , and the
maximal number of neighbours is k , the cost of processing all user’s feedbacks is O (knRT nF ) .
Furthermore, if the set of rule types that need to be rated can be narrowed down to under a small
value, the procedure becomes fast, as nRT is replaced by a smaller number (in the best case, the
training procedure becomes linear in the number of feedbacks).
However, before applying individual feedbacks, the initial neighbour structure needs to be set
up. Especially in the case when few feedbacks are available, it is important to take unrated rule
types into account. For example, if a rule type has nine close neighbours that do not have associated feedbacks, and one far-away positive feedback, the result should probably be closer to
neutral. To account for this, the neighbour structure should be built up with neutral ratings (i.e.
0.5) for each rule type. Then, when new feedbacks arrive, they complement the model and replace the corresponding neutral nodes. In this way, the predictions made by this model will take
neutral nodes into account.
Training a neutral model requires O (knRT nRT ) operations. Thus, the total training time increases
(
)
to O (knRT (nRT + nF )) , or to O kn*RT (nRT + nF ) if the number of rule types that need to be rated is
different from the total number of rule types. This can easily be avoided if the neutral model is
cached. As the set of rule types is constant between data mining runs, a neutral neighbour structure can be calculated once and saved. Then, a personal model for a single user can be built on
the fly, as the number of personal feedbacks is low.
4.6 Combined relevance
None of the three relevance values discussed in the previous sections (statistical significance,
collective interestingness and personal interestingness) is alone sufficient to describe interestingness of rules. Statistical significance is based on diff and the p-value and does not take into
account users’ preferences. Collective and personal ratings, on the other hand, do not rely on
objective values at all. Collective interestingness does not directly convey the personal preferences of the current user, while personal interestingness only relies on the user’s personal feedback. Moreover, even when collective and personal interestingness models work exceptionally
well, objective measures have to be taken into account anyway, as rules with the same rule type
will have different severity level indicated by diff and the p-value.
A combination of the three values would cover all of these aspects. As such, the problem of
combining objective and subjective rule interestingness measures is not studied enough [26],
and we will try to come up with a reasonable combination suitable for our specific setting.
42
4.6. COMBINED RELEVANCE
Statistical significance is a relatively objective value. Users’ preferences should be able to compete with statistical significance, but should not overshadow it: statistically strong rules ought to
come through in any case. Further, personal preferences should make an impact, but a strong
collective opinion should not be overlooked.
There are many ways to combine these three values. The approach taken here is illustrated in
Figure 13.
p
diff
collective rating
personal rating
*
statistical
significance
collective score
Hotness
Figure 13. Principal scheme for calculating hotness.
Statistical significance is firstly combined with the collective rating, and the result is further
combined with the personal rating in a similar manner. Note that all three values used here are
in the range between 0 and 1, so the result is also a value in this range. The values c1 and c2 are
combination parameters chosen depending on how the three relevance values are intended to be
balanced together.
c1 × ( statistical significance) + collective rating
c1 + 1
c2 × (collective score) + personal rating
hotness =
c2 + 1
collective score =
It is easy to show that this is equivalent to defining the result as a weighted linear combination
of statistical significance, collective and personal rating. However, we prefer the notation above.
The advantage of this scheme is that it is relatively easy to reason about. The dependency between the input variables and the output value is simple, and manual parameter tuning in accordance with intuitive expectations is possible.
Furthermore, the notation used in the above formula suggests a natural practical scheme for
incorporating personal rating into the calculation. The number of users of the system can be
quite large (see Table 1), and keeping an updated personal interestingness model for each user
in memory is hardly feasible. If, on the other hand, the model is rebuilt during a rule fetching
query, each rule needs to be associated with the corresponding personal rating, the resulting
relevance score computed, and the rule list sorted by relevance.
Instead, a combination of statistical significance and collective rating can be used to fetch the
candidate rules, and personal score used to rate the resulting set in the last stage, similarly to
what Figure 13 shows. Of course, this produces less exact results than applying personal rating
before fetching the rules. However, in practice the relative weight of personal rating will be
small compared to statistical significance and collective rating. Furthermore, more rules than
necessary can be fetched in the first stage, the total score computed over these and the necessary
number of top relevant rules of the resulting list returned.
For the case of statistical significance defined as in Section 4.2, its value for each rule is fixed
and can be computed only once after rules are mined. Collective interestingness of rule types
43
CHAPTER 4. INTERESTINGNESS MODELS
will be updated each time a new feedback arrives. Personal rating for a single user, as described
above, will be calculated on the fly and only for a limited number of rule types.
It should be noted that other schemes for calculating total interestingness may work as well.
More reasoning and experiments are necessary to study this. In our case the results given by the
combination formula discussed in this section were found satisfactory.
4.7 Hotness
Although it is fairly simple to explain the main idea behind the combination formula from Section 4.6, the resulting interestingness value is a combination of several relevance measures that
is not self-explanatory, especially considering all the complexity of the inputs.
The obtained interestingness is a value between 0 and 1. Showing it to users that are not familiar
with the details of the interestingness calculation mechanism is not only of little use; it may
even be counter-productive, as it would undoubtedly increase the perceived complexity of the
main system and add to the overall confusion caused by the abundance of various data mining
terminology that users need to deal with anyway.
Choosing a display name for the interestingness measure is important. Labelling it “Rank”,
“Rating”, “Interestingness” or “Significance” leaves room for misinterpretation and gives no
hint of what the value is based on. “Collective rank” or “Statistical significance” is misleading,
as the value is influenced by other factors as well, and separating the value into components
defeats the main purpose of this combined interestingness measure in the first place. “Relevance” is acceptable. It can be a combination of several factors and describes the purpose of the
value well. However, it does sound technical and somewhat unexciting.
Furthermore, displaying exact values calls for a way of interpreting the values in their own right
as well as differences between them. If the user sees two rules marked with 0.657836 and
0.642398, a natural question arises of how significant the difference between the two values is.
That is why it was found important to come up with an alternative, synonymous way to convey
interestingness values to the users of the system.
The sought-for metaphor was found in the name “Hotness” and a graphical representation in the
form of coloured versus black-and-white chilli peppers. The logo was filled with colour to the
extent corresponding to the total interestingness value, and the rest was left uncoloured. Hotness
is a fun name, it catches the user’s attention, associates well with the peppers, and suggests a
plain interpretation: the hotter the rule the better. It also makes the rule exploration process
sound more like a game and does not call for much more explanation.
Practice shows that choosing a suitable name and form of presentation plays an important role
for acceptance of a concept, be it a function, a value, a technique or a system; it may be as important as coming up with the technical solution. In the context of the major user acceptance
models, the two factors fit into Attitude toward behaviour in the Theory of reasoned action and
the Theory of planned behaviour, Perceived ease of use in the Technology acceptance model,
Affect towards use in the Model of PC utilization, Ease of use in the Innovation diffusion theory, as well as Affect in the Social cognitive theory (see [34] for further details). During an internal evaluation, hotness of rules on the chilli scale was found much more appealing, exciting
and easier to accept than the other alternatives.
44
5 Rule visualization
Section 2.4.4 presents two views of the role of data visualization tools. In our application we
decided to use visualization to support rule exploration and analysis, and not as part of the rule
discovery process.
In this chapter, we study two problem areas. The first one is visualizing a set of rules. Here, we
look for a way to present relevant rules in a compact and yet informative way that can help the
user to get a general picture of the most important issues found during data mining. The ambition is to come up with a visualization scheme that helps to build such a picture faster and easier
than by going through an ordered list of rule in text format.
Secondly, we look at how standard data visualization techniques can be used to aid users in
analyzing individual rules.
5.1 Displaying multiple rules
In the case of quantitative association rules that we focus on in the present work, the rule antecedent consists of categorical attributes and their values, and the consequent is a quantitative
attribute with an associated mean value. The two standard approaches (two-dimensional association matrix and directed graph) to visualizing sets of association rules are not directly applicable in this case.
In a two-dimensional matrix (Section 2.4.1), as the antecedents consist of categorical attributes
and values, it would be necessary to use these attribute-value pairs instead of antecedent items.
Furthermore, as combinations of these are allowed, an extension for several items has to be
used. This makes an individual antecedent in the graph complex, leading to a complicated matrix even when few rules are included. The second axis, unlike the standard case, would contain
quantitative attributes.
For a directed graph (Section 2.4.2), again, categorical attributes with values would either result
in a lot of nodes (a node per attribute-value pair) or, if the categorical values were encoded in
arcs, make the arcs more difficult to interpret. End-nodes representing the quantitative attributes
would have to represent their values as well. When added up, these complexity factors can potentially make the graph pointless, as the idea with having the graph in the first place is to make
understanding of the rules easier, not to complicate it further.
The idea about iterative focusing by Blanchard et al. summarized in Section 2.4.3, although
interesting, is not suitable in this context either. We seek a more direct way of displaying a
number of rules to the user; the user should be able to get a general picture without (or before)
doing further exploration. A form of iterative focusing will though be used in another context,
namely when rules related to the one the user chose to focus on are presented in the GUI.
The idea by Wong et al. about illustrating the rule-to-item relationship [36] may on the other
hand be more appropriate. In their association matrix, the rows represent items and the columns
correspond to item associations, i.e. rules. For each rule, antecedent items and the consequent
are represented by equally high blocks of two fixed colours. Rule statistics (for example, confidence and support) are shown as bars for each column at the top of the matrix.
We take this approach and the extended two-dimensional matrix (where combinations of items
are allowed) as the starting points, combine and adjust them further to work in our setting.
The most immediate adjustment would be to stick to the item-to-rule idea, but to separate the
categorical and the quantitative attributes. The consequents occupy the top rows of the matrix,
and the antecedents are listed below, each being a categorical attribute-value pair. Each column
is a rule, the corresponding categorical pairs are marked, and the corresponding quantitative
45
CHAPTER 5. RULE VISUALIZATION
attribute is marked with an icon that represents relevant rule statistics. In Figure 14, categorical
pairs are marked with crosses and quantitative attributes are marked with circles. The size of the
circles is proportional to the rule’s support and the colour indicates the p-value (transitions from
yellow to red correspond to differences between higher and lower p-values). Diff is shown as
text under each circle.
Rule 1
Question A
Rule 2
Rule 3
-1.8
Question B
-4.4
Question C
-3.4
Attribute A = v1
X
Attribute B = v2
X
Attribute B = v3
X
Attribute C = v4
X
Figure 14. Rule-to-item visualization of impact rules.
In a practical implementation, the graph can be interactive to facilitate further analysis. For example, when the user points at a column, the corresponding categories can be highlighted.
While this representation gives an overview of a set of rules, it only allows a single rule per
column and, more importantly, the number of categorical attribute-value pairs is likely to grow
with the number of rules and thus complicate the graph.
An alternative approach is to take a segment-to-item perspective. Instead of looking at rules
versus items, we can look at antecedents (in our setting, an antecedent represents a customer
segment) versus consequents, like in the classical two-dimensional matrix case. This at once
allows displaying more rules in the same graph (when there are several rules with the same antecedent). Unlike the popular visualization scheme that uses three-dimensional bars (see Figure
6), this graph indeed is two-dimensional, which means that rules do not mask each other.
Further, crosses in Figure 14 do not convey other information than incidence of attribute-value
pairs to specific rules. Instead, if a row corresponds to a categorical attribute, the crosses can be
replaced by values, and that will automatically indicate incidence, as well as solve the problem
with the exploding number of pairs.
The suggested changes are illustrated in Figure 15.
Segment 1
Question A
Segment 2
Segment 3
-1.8
-2.2
Question B
Question C
-2.7
Attribute A
v1
-4.4
-3.4
Attribute B
v2
Attribute C
v3
v4
Figure 15. Segment-to-item visualization of impact rules.
These changes make the graph more compact, but also help uncover structural information in
the rule set that is displayed. By changing column correspondence from rules to segments we
46
5.1. DISPLAYING MULTIPLE RULES
make it obvious which rules relate to the same group of responses. By keeping a single row for
each categorical attribute we make it plain to see how rules relate to attributes (instead of showing how rules relate to attribute-value pairs). This can be facilitated by means of another interactive feature: when the user points at an attribute the corresponding rules are highlighted. Also,
column labels can be removed as they do not convey any relevant information.
Note that although rules within a column correspond to the same segment of responses, it need
not necessarily mean that the support of these rules will be the same, as it is not guaranteed that
every data record that is mined has all the quantitative attributes filled will non-null values.
In theory, this graph is able to show a lot of information in a compact format and make it easy
for the users to get an overview of the existing rules. In practice, the number of columns is limited. This means that the number of rules that can be shown is also limited to s∙q, where s is the
number of segments that can be fit into the graph, and q is the corresponding number of quantitative attributes. Furthermore, in the worst case, if each rule adds a previously unused categorical attribute to the graph (for example, in the case when the left-hand side of each rule consists
of a single unique categorical attribute) and concerns a previously unused quantitative attribute,
the number of displayed rules will be only min(s, q) .
Notice, though, that this is not necessarily a bad result. The two-dimensional matrix and the
rule-to-item approach by Wang et al. suffer from a similar problem. The segment-to-item approach is in any case not worse than rule-to-item. It is guaranteed to be better (i.e. able to show
more rules) when most categorical attributes belong to several rules, which is the typical situation in practice.
Further, to make most use of the graph, the following graph populating strategy is suggested:
·
Fill the graph until the number of distinct segments in the graph reaches n = min(s, q) .
·
While there are more rules and more free slots left in the graph, add the next rule to the
graph if it fits in an already existing segment and the corresponding question slot is not
occupied.
Naturally, it is also necessary to check that the total number of attributes is not more than can be
displayed. Rules that add attributes that do not fit anymore should be skipped.
If the rules are fed into the above algorithm in the order of relevance, the graph in the worst case
will show min(s, q) relevant rules, and the rest will be a selection of less relevant rules.
The presented segment-to-item approach and the corresponding strategy for populating the
graph can present a more useful overview of multiple quantitative association rules than a simple rule list in the order of relevance. Rule-to-item has the advantage that the individual rules
are made more explicit, as each column corresponds to a single rule. The segment-to-item approach elicits more principal dependencies in the structure of rules. We believe that by studying
the correspondence of segments to quantitative attributes in rules, deeper insights into rule
structure can be gained. However, individual rules are less explicit and it is recommended to
implement additional interactive visual clues (in particular, highlighting of appropriate attributes) in order to make it easier for the users to study of the graph.
The presented graph can not give an overview of all mined rules. Although in the worst case the
graph will present relatively few relevant rules and the rest will be filled out with less relevant
results, in practice the graph will give a fair overview of the top relevant rules. This is well consistent with what we set out to achieve.
Elements of the focusing approach from Section 2.4.3 can be incorporated into the graph by
letting the user specify the set of quantitative attributes of focus (and possibly the sign of diff as
well). Then the rule graph will present relevant rules with the corresponding right-hand side. By
looking at the general overview first, the user may identify an interesting subset of questions,
and further focus on these in particular. This makes even more sense if the questions are logically grouped. Then, if a subset of such a group is represented in the general overview, focusing
47
CHAPTER 5. RULE VISUALIZATION
of the corresponding group will reveal other relevant rules that in the original graph were
masked by results more relevant in another sense (for example, statistically).
A separate matter worth looking at is how to group attributes and segments to make the results
easier to apprehend. One possibility is to arrange the attributes in alphabetical order. The obvious advantage with this approach is that it makes it easy to locate attributes of interest. Attributes can be logically grouped, and then it is appropriate to sort by group and by name within
each group. The groups themselves should be indicated on the graph as well. Another strategy
would be to display the attributes and segments in the order of appearance in the rule list. This
will leave the most relevant results in the top left area, or along the diagonal. In any case, visually the more relevant results will be somewhat grouped together. Segments consisting of several categorical attributes add complexity to this matter, as it is preferable to have the attributes
within a segment together in the list. The segments are easier to read this way. Luckily, though,
in practice most rules have a single categorical attribute (apart from the unit group attribute and
the trend attribute), and the necessity of such grouping is not immediate.
Note that the suggested visualization scheme is as such only applicable to rules within a single
unit group. For a single business unit, it is likely that the capacity of the graph will be enough to
give a good representation of the existing rules. If rules concerning several unit groups need to
be presented together, a modification of the graph is necessary. One solution is not to consider
the rule’s unit group when filling the graph and then label the rule icon with the corresponding
unit group in the graph. Another possibility is to allow several rule icons (corresponding to several unit groups) in the same graph slot.
Further, it might be interesting to visualize other dependencies. For example, visualizing the
correspondence between sets of categorical attributes and quantitative attributes in rules within
a single unit group or several groups will give another structural view of the mined rules. Moreover, more rules can be fit into such a metasegment-to-item graph and give an even broader
view of the rule set.
5.2 Data visualization for individual rules
In the target application, it is useful not only to visualize a whole set of rules, but also to aid
users in studying single rules. A simple way to achieve this is to use traditional data visualization techniques to show the data behind a rule. More specifically, we chose to accompany each
rule with distribution and trend graphs.
Figure 16. Response distribution graph.
Bar graphs show the distribution of survey responses for each categorical attribute of the rule.
Each bar on such a graph corresponds to a response alternative. For each alternative, the confidence interval is displayed as a thin vertical line and it emphasizes the interval where responses
are expected to be found with 95% confidence. The bar corresponding to the attribute value in
the rule is highlighted. The visual clue is simple: if the top of the bar is outside the confidence
48
5.2. DATA VISUALIZATION FOR INDIVIDUAL RULES
interval, the deviation from the expected response value is significant. The number of responses
available for each alternative is indicated. Figure 16 shows an example of a distribution graph.
When looking at the distribution graph, it may be helpful to know how many null responses
there are in the group. If such responses exist, we show them in a separate bar. Further user tests
are necessary to find out whether showing uncategorized responses is helpful or just confusing.
When the left-hand side of the rule contains several categorical attributes, a bar graph is displayed for each attribute. The graph is built from responses where all the categorical values
except for the plotted one are as in the rule.
Also, a trend graph is displayed for each rule. For trend rules, this graph contains a single curve
that follows the average response to the rule’s question over time, and the time period the rule
concerns is marked. For non-trend rules two curves are shown: one follows the responses behind the rule, and the second one follows all responses in the rule’s unit group. For example, the
curve for the rule “unit X, attribute Y → satisfaction” will be accompanied by the curve based
on all responses to satisfaction at unit X. The curves are fit into the response points using
weighted linear least squares regression. Such curves are known as Lowess curves [18].
Individual survey responses are marked with crosses. They are spread out before plotting to
make the graph more readable and to give the user a better idea of the sample size, which is why
the vertical position of the crosses may visually deviate from the allowed discrete response values. For rules from the watched rules list, an arrow at the bottom of the graph shows since when
the rule has been watched. Figure 17 shows the corresponding trend graph for such a rule.
Figure 17. Trend graph for a watched trend rule.
A future experiment is to add more curves to compare against, especially in the case of rules
with multiple categorical attributes. These curves can include responses from the whole available population (when the categorical attributes on the left-hand side are not unit groupspecific), and curves for the other attributes and their combinations. For a rule of the form
{a1,a2,a3}→xq, it might be interesting to see the curves for a1→xq, a2→xq, a3→xq, as well as
{ai,aj}→xq and {a1,a2,a3}→q. The symbol →x stands for “in unit group X”.
Another important task is to evaluate the selected curve fitting method. Lowess is computationally intensive and sensitive to outliers. So far, it has shown fully acceptable results. However,
if problems arise, other possibilities should be considered.
49
6 Results and discussion
This chapter describes the solution that was implemented. We start with discussing the
architecture and the implemented user interface features. Further, some of the identified issues and possible solutions are examined, including calculating the similarity
between attributes, the need for comments mining and a further simplification of the
workflow, as well as the choice of the difference value to show to the users. The use of
interestingness measures is discussed, and ideas about how to further improve the implemented visualization tools are presented. The chapter is concluded with general remarks about the implemented system.
From the very start, it was decided to build the system iteratively, as building a perfect product
in one gulp is scarcely realistic. The first phase, however, was bound to be extensive. The goal
was to have the most important system components in place in an extent that would make it
possible to start testing the system with real clients soon and involve them into the development
of the product in posterior phases.
6.1 Simplified architecture
Several major simplifications were made to the system to facilitate the first phase of the project,
in line with what was accentuated in Section 3.5.5. The simplified system architecture used for
implementation is shown in Figure 18.
Data mining engine
Rule server
Database
Data miner
rules
survey data
Data fetcher
rules
survey data
and mining preferences
rules
Rules
Watch/Trash
Filters
Ranker
Rule agent
rule query
ranked rules
Application
rules
MiningDataDistributor
Mining front-end
Figure 18. Simplified system architecture.
One major difference from the general architecture is that the control module is missing. Although the control module is an important part of the system (see Section 3.5.4), it is fully possible to start using the system without it. Therefore it was decided to postpone its implementation to a later project phase. In the first version of the system, mining needs to be started manually from the host application.
Another simplification is the lack of support for incremental updates of the satisfaction database
on the data mining engine side. In fact, no database is used by the DME. Instead, all data that is
necessary for mining is sent over from the main application each time mining is due. Also, the
mined rules are not backed up on the DME after mining and are sent to the rule server directly.
Furthermore, mining is done only on the business unit level in the first project phase. This reduces the applicability of the system but also simplifies the implementation, especially when it
50
6.1. SIMPLIFIED ARCHITECTURE
comes to GUI functionality. On the data mining side, the upgrade that would allow mining arbitrary unit groups is quite simple and is described in some more detail in Section 3.5.2. However,
access to rules is still implemented in accordance with what business units the user has access to
via the unit group the user is assigned to (the user is allowed to browse the rules corresponding
to all leaves that are reachable from the user’s node in the unit group structure).
Figure 19 shows the simplified data mining engine. The data link between the application and
the DME, as well as between the DME and the rule server was implemented via RMI (remote
method invocation), a technique allowing to call methods of remote interfaces. Input and output
data structures are serialized in a binary format. This is particularly suitable when large amounts
of data are transferred, which is what expected to happen in both cases. Potentially millions of
records are sent from the application to the DME, and hundreds of thousands of rules (and
more) can be generated.
Application
Data mining engine
MinerThread
su rvey d ata
and minin g pre fe re nces
MinerThread
via RMI
MinerThread
Rule server
IntegratedMiner
surve y da ta
rules
DataFetcher
ru les
via RMI
Figure 19. Simplified data mining engine.
Watched and trashed rules are crucial to the workflow outlined in Section 3.4 and were implemented in the first phase. However, watched rules are not stored or cached on the application
side as was suggested in Section 3.5.1. The main idea with storing watched rules on the application side was to boost the performance by eliminating unnecessary calls to the rule server. Not
doing so is not critical and is fully transparent for the users of the system.
Following the discussion in Section 3.5.3, watched rules are replaced with their newly mined
counterparts when a new rule set arrives to the rule server. To simplify initial implementation,
watched rules for which no counterparts can be found in the new rule set are left intact. Thus, no
changes are necessary on the DME side. This is acceptable for the most part, because focusing
on the rule will reveal an up-to-date trend graph that shows the development of the studied
mean response value over time, thus making updates to the formal diff of the rule less relevant.
Naturally, the trend rules are marked with absolute dates to avoid outdating of relative values. If
a rule concerns the last 10 days, tomorrow it will refer to the last 11 days, although it should be
made clear what data the rule is based on, i.e. whether the rule is based on the most current data
or up to a certain date in the past.
Although potentially useful, classification of unit groups has been left out. All unit groups are
assumed to belong to the same class. This implies that no distinction between preferences of
users at different levels will be made, i.e. collective interestingness will be shared among all
users of the system (for the same client) in spite of the requirement 2.d from Section 3.3 stating
that interestingness of rules can be perceived differently by users on different levels. It is believed that complying with this requirement would not noticeably improve users’ experience,
especially considering that only business unit level rules are supported in the first project phase.
Distinguishing between collective preferences of users on different levels makes much more
sense when there are rules on all levels. Thus, this simplification is justified. Moreover, technically the upgrade does not pose any principal difficulties.
Several workflow features, namely leaving comments on rules, sharing of rules between users
and the rule assignment mechanism, were also left out of the first phase. These features are important to the workflow, but their principal part is application-specific and includes a lot of GUI
engineering, so they can be implemented separately when the rest is in place and the system has
51
CHAPTER 6. RESULTS AND DISCUSSION
been tested on real clients. This also gives the opportunity to start out with a simpler interestingness model that does not rely on this type of feedback. In addition, clients’ feedback on the
core of the system functionality may suggest reworking parts of the system. It is thus logical to
wait with the extra functionality until the core has been tried out and polished.
Finally, click-through information is not gathered in the first phase of the project. To begin
with, it is not incorporated into any of the interestingness models from Chapter 4. However,
implementing click-through data collection is important and should be done in one of the subsequent phases of the project. As suggested in Section 2.3.5, click-through can not only be used
for rule interestingness estimation, but also to evaluate existing interestingness models. This is
further discussed in Section 6.7.
The simplified rule server is illustrated in Figure 20. The data interface towards the main application was implemented as XML services in order to make the communication more generic
and to ensure consistency with the communication schemes used between the application and
other components. This only can become a problem when sets of rules are sent back to the application often. However, a significant part of information about the rules is packed down into a
string structure before being externalized to XML, partly to avoid the overhead of specifying the
various fields that is part of both XML and RMI serialization. Thus, sending rules over in another standard format should not make a principal difference in performance.
Rule server
RuleAgent
DME
Application
rules
rules
rules
RMI interface
Database
RuleDbAdapter
Wa tch/Tra sh
ru le q uery
RuleCompany
XML services
filte r qu ery
ran ked rules
Ranker
Disk
NeutralRanker
Figure 20. Simplified rule server.
The rule ranking mechanism was implemented in accordance with Section 4.6. The rule server
can handle multiple clients. There is only one interestingness level, therefore for each client a
single collective ranker is trained with user feedbacks according to the collective interestingness
model from Section 4.4. In addition, a neutral ranker is built each time the rule set is replaced
(and thus new rule types can emerge). It is later used for personal rating calculation as was suggested in Section 4.5.
6.2 User interface
The GUI components implemented on the main application side in the first phase of the project
include the rule list allowing the users to browse mined rules, the rule explorer where rules can
be studied in substantially more detail, the watched rules list where watched rules can be
browsed (it was decided not to show the trash list to the users), the mining setup interface where
the administrator can specify mining parameters and initiate mining from, as well as the segment-to-item rule overview graph from Section 5.1.
The implemented rule list roughly corresponds to something in between the simple and the advanced viewing modes from requirement 1.c under System requirements. We chose to wait with
implementing the truly advanced rule browsing mode, but allowed filtering and sorting rules in
the simpler interface anyway. The rule list is illustrated in Figure 21.
For each rule, its hotness on the chilli scale is indicated (in the form of coloured chilli peppers
as suggested in Section 4.7), a textual description is given, and the average response value is
52
6.2. USER INTERFACE
shown together with the difference from the expected response and the rule’s impact (see Section 4.1). Trend rules are marked with a timeline icon (a downward arrow for negative rules and
an upward arrow for positive rules). Users can filter the rules by node in the unit group structure, by question category (all quantitative attributes are grouped in the application) and by the
sign of diff (positive or negative). As it was already mentioned in Section 3.2, trend rules have
been found to be of special importance, which is why it is possible to filter by the trend attribute
as well (present or absent). Sorting by hotness, average response, diff and impact is allowed.
The available combinations of filtering and sorting settings allow carrying out focused analysis
of the rule set effectively.
The watched rules list has a similar format, except that no filtering is necessary.
Figure 21. Rule list.
To let the user quickly work through a list of rules, the textual description is short by default and
only contains the right-hand side attribute and the unit group the rule concerns. If a rule seems
interesting, the user can see the full textual representation in the rule list as well, as illustrated in
Figure 22. The text generation scheme is simple. Basically, the following formula is used:
The <support> [guests] with <left-hand side attribute-value pairs> [at] <unit group> average
<average response> on the [survey question] <quantitative attribute>, which is <diff>
[above/below] the expected value.
Rule components fill out the fields marked with <>. Some of the words in the sentence are
marked with []. They have to be replaced appropriately depending on the business context and
the rule the text is generated for. This includes business-specific terms (e.g. “guest”, “survey
question”) and context-dependent words (e.g. “at” might need to be changed depending on the
type of the unit group, whereas the choice between “above” and “below” depends on the sign of
the diff). Handling of business-specific terms was in our case already implemented as part of the
internationalization API in the host application.
Figure 22. Full textual description of a rule in the rule list.
More details about a rule can be found in the rule explorer. To begin with, the graphs from Section 5.2 were implemented. For each categorical attribute of the rule, the distribution of responses is shown in a bar graph (values for the rest of the attributes are fixed as in the rule). The
trend graph shows the responses behind the rule over time and the average response at the given
unit group for the whole time period to compare against.
53
CHAPTER 6. RESULTS AND DISCUSSION
Following the guidelines from Section 3.4, a selection of individual responses that the rule is
based on is made available, together with the corresponding text comments left by the respondents. For negative rules, the lowest responses with filled text fields are shown. In many cases
this allows to find the source of the problem. For positive rules, the comments corresponding to
the highest responses are shown.
In addition, a selection of related rules is given. An obvious idea is to show other rules with the
same left-hand side, i.e. rules concerning the same group of responses. The importance of other
related rules is less obvious and they are positioned separately on the rule explorer page. These
include rules with the same left-hand side attributes at the given unit group. For example, for the
rule stating that females are less satisfied with overall service at a specific store, other genderrelated rules at the same store are shown. Similarly, rules with the same quantitative attribute at
the given unit group are shown. In the example above, it would be other rules concerning overall service at this store. Also, a selection of these two categories at all properties the user has
access to is shown. In total, we present four groups of related rules apart from the rules concerning the same group of responses.
Since the control module was left out in the initial phase of the project, a relatively complex
data mining setup interface had to be implemented in the main application.
To begin with, it is necessary to specify the categorical and quantitative attributes to build rules
from. When mining for logically advanced rules, it is important to be able to derive values from
existing fields and mine the derived values. For example, it might be interesting to find rules
concerning the day of week of visiting a store. The day of week of visit can be derived from the
date of visit. However, semantically this should be done before data is sent over to the data mining engine, as this type of transformation has nothing to do with the mining process itself. This
feature was implemented in the setup interface.
When the DME mines unexpected relations, generated rule candidates are tested against other
unit groups. If a relation is common, it will not be reported as an unexpected rule and will in our
case be identifiable by other modules in the main application. Some of the categorical attributes
are unit group-specific and such rules should not be compared against other groups. For example, a rule concerning bathroom quality at a specific room in a hotel should be reported directly
because room number is an attribute specific to the hotel in question and it makes little sense to
compare it with rooms at other hotels. Unit group-specific attributes must be specified in the
setup interface. Note that the implemented solution only mines rules at business units. If generic
mining is allowed, unit group-specific attributes may need to be specified for each level.
Other mining settings include the trend attribute (of all available date fields, a single one will be
used for trend rule mining), trend period (the maximum time span to allow for trend rules, e.g.
do not look for trends more than 3 months ago), trend granularity (time unit to use for trend
mining, i.e. track changes in satisfaction by day, week, month, etc.) and the mining period (only
responses in the specified time span will be mined). To allow early pruning of rule candidates,
the minimum allowed sample size (the minimum number of responses that a rule can be based
on), the minimum diff (the deviation of the observed value from the expected value should be at
least this high) and the maximum allowed p-value need to be specified as well.
Another important mining-related setting is the number of categorical attributes allowed in a
rule. In the implemented data mining engine, the number of attributes can technically be arbitrary. In our experiments, though, we have seen that allowing more than two categorical attributes in the same rule is unlikely to result in useful rules. Rules with more than two attributes are
uncommon because first of all the support becomes lower as more segments are combined, and
this often either results in lower statistical significance or support below the minimum allowed
sample size. Also, useful rules usually have a simpler underlying reason that can be described
with few attributes. Hence, we allow rules with only two categorical attributes by default. Note
that the unit group attribute and the trend attribute are treated separately. Thus, rules like “The
10 guests with Gender: Female, Waiter: X, and Visit date<15 days ago at Restaurant Y average…” can be discovered, giving three segments apart from the unit group.
54
6.3. SIMILARITY BETWEEN RULE ATTRIBUTES
In addition, a number of settings that determine how rules are displayed in the GUI are available
in the setup. These include the text fields to show in the user comments section in the rule explorer and the fields that will accompany each text comment (for example, the name of the respondent, date of visit to the store, etc. can provide better insight into the responses behind the
rule). Also, it may be practical to be able to steer the number of related rules to show and the
maximum number of text comments in the rule explorer, the maximum number of rules in the
rule list as well as the number of rules per page, etc.
6.3 Similarity between rule attributes
Recall from Section 4.4 that it is necessary to know the similarity coefficients between pairs of
attributes (separately for categorical and quantitative attributes) in order to determine distance
between rule types, which is in turn necessary for collective interestingness estimation. Recall
further from Section 2.5 that Bennedich solves this by automatically calculating correlation
coefficients for the two sets of attributes. For quantitative attributes, he computes pairwise Pearson correlation coefficients for the set of available responses. For categorical attributes, the
identified rules are looked at and the similarity between attributes is defined by how many rules
with each quantitative attribute the different categorical attributes are part of. In Bennedich’s
experiments, this approach gave good results.
In our experiments with the sample databases we had, the coefficients calculated according to
this scheme were found less adequate. Most discouraging was the fact that the obtained coefficients did not correspond to our intuitive expectations about which attributes should correlate
strongly.
The categorical coefficient matrix was the biggest problem, and several modifications to the
Bennedich’s calculation scheme were tried out. First of all, in the original algorithm a round of
smoothing was run on the obtained correlation matrix. The idea was to find “illogical” cases
where rik ∙ rkj > rij for some attributes i, j and k, and smooth out such cases by reducing rik and rkj
slightly and increasing rij so that rik ∙ rkj = rij. On our datasets this smoothing procedure made the
coefficients more distant from what was expected, so it was decided to skip the smoothing step.
Another idea was to include the quantitative correlation coefficients obtained in the first stage of
the coefficient calculation process into calculating categorical coefficients. Instead of simply
looking at how many rules there are with each quantitative attribute for a given categorical attribute, one could take the similarity between the quantitative attributes into account. Integrating
this into the calculation made a significant improvement of the perceived quality of the obtained
categorical correlation matrix.
However, the coefficients still did not fully correspond to the intuitively expected results. From
the user’s perspective, the statistical co-occurrence of attributes in rules is of little significance.
What is important is that the system reacts as the user would expect. Intuitively, bathroomrelated questions are perceived as similar independent of what the set of responses or the set of
rules suggests. Moving a rule to the watched list or the trash list should affect other rules that
are perceived as similar, not necessarily those that are statistically similar. Moreover, the automatic coefficient calculation scheme above is sensitive to the input data, which is not guaranteed to be consistent each time it arrives to the input of the DME. Quantitative coefficients will
vary with available response data, while categorical coefficients will change even when mining
thresholds such as minimum support and diff change.
To avoid this instability and to make the coefficients consistent with users’ intuition, it was decided to give up the idea of unguided coefficients deduction. Instead of calculating the coefficients automatically, the coefficients have to be specified in the setup interface. It would of
course be tedious to fill in the coefficients by hand, so the idea is to group the various attributes
into logical groups. Different attributes within the same group have a fixed similarity coefficient
that is less than 1. Correlation between attributes from different groups is 0. The procedure
could be further simplified by using the fact that the attributes were already grouped in the host
55
CHAPTER 6. RESULTS AND DISCUSSION
application for other purposes. Thus, only two correlation coefficients needed to be specified:
the correlation between two different attributes in the same group of categorical attributes, as
well as the corresponding value for quantitative attributes.
When data that should be mined is sent over to the DME, a set of mining preferences is sent as
well. In particular, the two correlation matrices calculated as described above are sent as part of
the preferences data structure. If a more fine-grain coefficient calculation scheme is implemented later, only the corresponding module in the application will have to be updated. The DME
and the rule server are independent of the coefficient calculation scheme.
6.4 Comments mining
Text comments from the responses behind a rule can provide valuable insights into the underlying factors that can help explain an unexpected satisfaction mean. In practice, the suggested
scheme for selection of text comments does not give consistently satisfactory results. A typical
survey has many satisfaction questions but few comment fields, so the comments can cover a
whole range of issues or a specific issue that does not necessarily coincide with the studied satisfaction parameter. With the implemented comments selection mechanism it is not uncommon
that most of the text comments shown in the rule explorer are of no particular interest in the
context of the rule the user is focused on, or that a comment is quite big and only a small part of
it concerns the question of the rule. What is more, people tend not to comment on positive experiences and mostly write about negative incidents. This is why showing text comments for
positive rules in fact rarely works at all, as most of these concern other factors that people were
not content with.
Clearly, a more rigorous technique for handling text fields is called for. Our early reviews of the
product made it clear that the clients realize the high value of text comments in rule exploration,
including the case of positive rules. Text mining can be employed to identify the topics raised in
the comments as well as the emotional charge of the opinions (positive or negative). With this
type of classification engine in place, only comments that actually have something to do with
the question of the rule will be shown, and only those that have the right tone, i.e. positive for
rules with a positive diff and negative for rules with a negative diff.
Text mining was not studied as part of this project, but rather in the format of a separate parallel
project. In future phases of the data mining project, the text mining module will be employed in
the rule explorer for automatic selection of relevant text comments.
The following relevant text mining literature can be recommended for the interested reader willing to implement a text mining solution in a setting similar to ours: [40], [41], [42] and [43].
6.5 Streamlined workflow
An idea that emerged during one of the product evaluation sessions is to streamline the rule list
further. The rule list as presented in Section 6.2 is still relatively complex and takes time and
energy to work through. A user like a business unit manager is typically a busy person anyway
and an easier, more immediate and less overwhelming presentation would in many cases be
appreciated, especially when it comes to the number of rules shown. Knowing right away that
there are many rules to go through may discourage a busy user from working with the module.
A possible solution to this problem is to implement a lightweight version of the rule list that
only presents a brief set of rules (the top rules from the rule list, a selection of rules from the
watched list, or possibly a combination of both as suggested in Section 3.4) and lets the user to
take action on them or dive in to see the details by linking to the rule explorer, similarly to the
Google Reader home page module which lets the user mark, know what has been read, get a
tool tip and dive deeper. This lightweight rule list can be shown on the user’s start page in the
application, together with other relevant summaries. This automatically brings the idea of work-
56
6.6. DIFFERENCE VALUE IN THE GUI
ing with rules into the everyday set of activities. The user does not necessarily need to actively
go into a separate page devoted to data mining. On the other hand, users who work with rules
more actively and want to study existing rules closer might prefer to use the standard rule list. A
possible visual design for the streamlined rule list is illustrated in Figure 23.
Also, the watched rules list needs to be developed further. Presently, it simply lists the watched
rules and it is left to the user to track any changes in the watched rules. A smarter solution
should be worked out. For example, the user could get an automatic notification if a significant
change in the average response value of a watched rule has been detected, or if the diff of a rule
has changed from negative to positive.
Another possibility for making the system easier to work with is to replace all technical terms
with suitable counterparts from commonly used language. It is discussed closer in Section 3.4.
Figure 23. Short summary of the top relevant rules (insights).
6.6 Difference value in the GUI
A relevant issue that was identified is what diff value to show to users. Currently, the diff calculated with the restrictive model described in Section 2.5 is used both internally (in particular, to
calculate hotness of rules) and externally (it is shown to the user in the GUI). For rules with a
single categorical attribute, diff is the difference between the average response in the given
segment and the average at the whole unit group. However, for rules with several categorical
attributes the calculation is more complex. Consider the following example:
·
unit A → satisfaction = 8.0
·
unit A, attribute1 → satisfaction = 7.1
·
unit A, attribute2 → satisfaction = 7.5
·
unit A, attribute1, attribute2 → satisfaction = 2.0
The last relation is clearly interesting. Let us look at the corresponding diff value. The data mining engine will calculate mexp = 8.0 - (8.0 - 7.1) - (8.0 - 7.5) = 8.0 - 0.9 - 0.5 = 6.6 , identify the expected
interval to be [6.6, 7.5] and finally find a diff of 4.6, and only if this diff is statistically significant will it report the rule. This diff, however, is incomprehensible for a regular user of the host
application. On the contrary, comparing with 8.0 would be easy to understand (and consistent
with the single-category rules). This could be presented roughly as “The average for the clients
at unit A with attribute1 and attribute2 is 2.0, which is 6.0 points below the average for unit A”.
In theory, it may also be possible to understand a comparison with the direct fathers of the itemset of the rule. This would lead to the following statement: “The average for the clients at unit A
with attribute1 and attribute2 is 2.0, which is 5.1 points below the average for clients with attrib-
57
CHAPTER 6. RESULTS AND DISCUSSION
ute1 at unit A”, or “The average for the clients at unit A with attribute1 and attribute2 is 2.0,
which is 5.1 points below the average for clients with attribute2 at unit A”. However, it is not
clear which of the two statements should be shown to the user. In addition, it is a more complex
statement than just comparing with the average response value at the unit as a whole.
All in all, when showing rules to the user, it might be better to show the difference from the unit
group average. However, the question remains which difference to use when calculating hotness
and impact. Using the internal diff may be misleading, and it makes impact hard to relate to. On
the other hand, in our experiments rules with several attributes were significantly fewer than
rules with a single attribute, so the side effects of changing the diff are likely to be limited. Further experiments are necessary to see whether this solution is good enough.
6.7 Interestingness
The individual strengths and weaknesses of impact and hotness have already been discussed in
detail in Chapter 4 where these interestingness measures were introduced and described.
When combined, we noticed that the two interestingness values shown in the rule list may be
somewhat confusing for average users. Our idea with providing these two measures was that as
they are so different, they can be used together to give a better picture of the relevance of rules.
Real users wonder instead which of the two parameters is most important for judging rule relevance. Rather unexpectedly, some users looked at the diff more often as they found it easier to
relate to.
One possibility is to hide the hotness column, but order rules after hotness anyway (this is especially suitable for the lightweight rule list described in Section 6.5). The problem with not having that column is that if the user then sorts on one of the other columns, it will not be possible
to get back to the original ordering again. Then perhaps sorting on other columns should not be
allowed. Thus, hotness is not shown, but the rules are sorted by the hotness value. For business
unit users this may be suitable because the number of rules is low enough that it is possible to
go through all of them without needing to sort by different parameters. On the other hand, this
could be a more severe limitation for higher level users.
More importantly, however, hotness can not as of now be relied upon as the one appropriate
rule relevance measure. Although the implementation behaves as expected in accordance with
the models from sections 4.3-4.6, no strict enough formal justification of why these models are
appropriate was given, apart from an explanation relying on intuitive expectations. Of course, as
it was pointed out earlier in this text, interestingness as such is a subjective term, and there is no
single right answer to the question of which rules are truly interesting. If there was a way to
know the “right” interestingness values for at least a smaller example set of rules, and the hotness model was believed to be fully appropriate, optimization techniques could be used to
automatically find suitable model parameters that were so far chosen by hand. Unfortunately, no
such training set is available.
In an attempt to find a solution to this problem, it can be worth experimenting with ideas by
Joachims summarized in Section 2.3.5. The simplest starting point is to try out his scheme for
evaluating relative quality of two retrieval functions, by comparing sorting by hotness and by
impact, or by hotness and by diff, or by diff and randomly, etc.
Further, as it was shown in [22], click-through data can be used to build effective interestingness models. The question is what features should identify a rule in this setting and should be
used as input variables in the corresponding optimization problem. Presence or absence of categorical and quantitative attributes could be used, but then a combination measure like the one
suggested in Section 4.6 will still have to be employed. Even if that works out in practice, such
a solution would not be as responsive as the hotness model. In order to take a new feedback into
account, the optimization problem needs to be solved again. Thus, this would have to be done
58
6.8. VISUALIZATION
periodically. Further, depending on the performance of the SVM and the amount of available
click-through data, calculating personal rating with this method can become unfeasible.
In any case, implementation of a mechanism for collecting click-through data should be prioritized to make sure that the necessary feedback is accumulated while further research and experiments are carried out. So far, it is recommended to leave both hotness and impact visible in
the rule list and gather more feedback from the users.
6.8 Visualization
When it comes to visualization of rules, the three graphs from Chapter 5 were implemented. The
segment-to-item multiple rule graph was implemented in Flash in its simplest form that can only
handle rules within a single business unit. The distribution and trend graphs for individual rules
are on the other hand generated on each load of the rule explorer as non-interactive static images. The graphs, and especially the segment-to-item visualization, are extensively discussed
one by one in Chapter 5. Some of the concerns and considerations for the future, apart from
what has already been said, are presented below.
An important implementation consideration is how to select data for building the distribution
and trend graphs. If a rule has high support, plotting all individual responses on the trend graph
will make it unreadable and take long time. However, even when the support of the rule is low,
the average response curve for the whole business unit for the whole mining period is displayed,
and the number of responses there can be unnecessarily large. When the number of available
responses is large, a random selection of a fixed number of responses should be used instead. To
make the graph stable when the rule explorer page is reloaded, a fixed seed value should be
used. Another obvious idea is to reuse the graphs that have already been rendered.
However, ideally a different implementation should be used for the distribution and trend
graphs. To make them more useful and to involve user into data exploration, the graphs should
be made interactive. Among other things, the individual responses in the trend graph should be
made clickable and link to a page with detailed information about each response. Also, focusing
and zooming on a part of the graph should be implemented. If the number of available responses
is large, this will allow studying the individual responses in a given area that are not shown otherwise. For the distribution graph, it should be possible to focus on a bar and explore the responses behind it.
In the future, other visualization techniques should be looked at as well. As it was already suggested in Section 3.2, decision trees can be useful for getting a better understanding of data. At
least two possibilities should be experimented with: building a tree based on the unit group and
the quantitative attribute of the rule (constructing trees with quantitative target attributes is not
straightforward though), and building trees freely as an additional tool for data exploration.
Also, it may be worthwhile to experiment with more advanced graphs than the relatively simple
distribution and trend graphs. In particular, visualizing data in a 3D graph as it was done in [25]
can be both helpful and stimulating for the user. For example, for rules with two categorical
attributes a 3D graph can be built showing the two categories on the x-y plane and the quantitative attribute on the z-axis. Note that it is also possible to use the segment-to-item graph for a
similar type of visualization by looking at it from the category-to-category perspective.
The somewhat over-ambitious rummaging approach presented in [9] might be interesting in the
context of data exploration rather than rule exploration. If a comprehensive visual data exploration module is integrated, the user can be given the possibility to explore the whole available
data set interactively by focusing on various attributes and attribute groups.
59
CHAPTER 6. RESULTS AND DISCUSSION
6.9 General remarks
The general system architecture presented in Section 3.5 has been found adequate so far. Based
on it, a real functioning rule mining system was implemented and integrated into a live product.
Although the architecture and the initially suggested functionality was substantially simplified
as described in detail in Section 6.1, most of the simplified and postponed features are planned
to be implemented in a fuller scale in later phases of the project.
Overall system performance as known at the time of writing is fully satisfactory. In spite of the
initial performance concerns, the system actually performed better than expected. Mining a
hundred thousand surveys with dozens of categorical and quantitative attributes (tested on hospitality, electronics retail, weather and economy data sets) takes a few minutes on a regular
portable computer from initiating a mining run in the setup interface and until the new rule set
can be browsed in the rule list, including data preparation and communication between the
components of the system. In conjunction with this, the possibility to couple the data miner and
the rule server with the main application so that they could be started and stopped synchronously was requested as a separate feature. The system’s response to user actions is smooth. On
the GUI side most of the time is spent in dealing with technicalities imposed by the host system.
More details will be available once a framework for unit testing of GUI features is in place, and
it is under construction at the time of writing. Rule server handles multiple clients and users fast
enough as confirmed by stress testing with unit tests. Scalability and applicability of the architecture was discussed closer in Section 3.5.5.
Naturally, there are numerous possibilities for optimization, from improvements of purely technical nature (for instance, optimizing serialization of communicated objects in order to speed up
data transfer between the components of the system) to more substantial algorithmic changes.
For example, if the system’s response to incoming user feedbacks is found to be too slow, one
possibility is to employ clustering of rule types so that the most distant neighbours are skipped
without being looked at when the collective rank of rule types is updated. However, such optimizations are left for the point it time when performance becomes an issue.
In its current form, the data mining module will be most valuable for business unit managers.
Implementing support for rules at higher levels in particular and arbitrary unit groups in general
(both in terms of mining and working with them via the GUI) will make this tool significantly
more universal and applicable. As it was mentioned earlier, the mining part of this upgrade is
relatively straightforward. The main question will be how to present such rules to make them
truly useful – whether rules at the current unit group and its subgroups should be merged in a
single list, or whether they should be summarized separately, and how exactly it should be done
to make sure the user is not overwhelmed with too much information to consider.
Further, when mining of rules at different levels is implemented, different mining parameters
will have to be used. At the business unit level it might be interesting to see rules with a small
support (for example, to be able to identify a problem in a single room at a hotel), while such
rules are probably irrelevant at the region or country level. On the other hand, it might be still
interesting to see rules with a small support at individual properties for users at the middle level,
for example in order to inspect small problems as a means of controlling the quality of service at
the associated business units. When rules at multiple levels become available, it will be necessary to find a proper balance between the flexibility of the system and the ease of using it.
Another high-priority update is to implement the control module, or at least parts of its functionality. The most urgent function is to be able to schedule mining runs. In the beginning, the
simplest form of scheduling can be implemented via the setup interface of the main application,
where the user specifies how often mining should be re-run and the system initiates updates
automatically irrespective of other tasks executed by the application or by the target DME.
60
7 Conclusions
This chapter wraps up the thesis. We summarize what has been done and look at how it
corresponds to the goals of the project. We briefly reiterate the ideas about how this
work can be extended and what areas should be studied further.
In the current paper, we developed a practical framework for quantitative association rule mining and gave recommendations on how to present the discovered rules to end-users. In particular, we accomplished the following:
·
Designed a system for mining and working with quantitative association rules.
·
Selected a suitable mining technique and suggested an alternative mechanism for generating rules from trend rule candidates that is believed to be more appropriate than the
original scheme.
·
Developed a workflow allowing users with limited technical background to make effective use of the data mining results to support their everyday work.
·
Suggested a general architecture for the rule mining system that covers a range of practical applications, spreads logically separate functionality between several independent
modules and scales well with the number of clients served by the system.
·
Reasoned about what constitutes a relevant rule in the setting studied in this project and
built two interestingness models that together cover factors relevant for rule interestingness reasonably well. Impact is a measure that is easy to understand and that connects
well with the business context. Hotness is an experimental interestingness measure that
gives a user-specific interesting value for each rule. It combines statistical significance
of the rule with collective intelligence of all users and personal preferences of the current user.
·
Suggested the segment-to-item technique for visualizing multiple rules in a compact
and accessible format, as well as the metasegment-to-item approach that is a further
generalization that groups rules by structure. In addition, data visualization for illustrating individual rules was discussed.
·
Implemented a working system based on these ideas and integrated it into an industrial
software product proving that the developed techniques and designs are feasible in a
practical application.
·
Listed concrete ideas for improvement and areas for future research, including:
o
Further simplification of the workflow. The lightweight rule list was introduced
and the importance of term substitution was emphasized.
o
Evaluating the interestingness models suggested in this paper through user tests
and by means of analyzing click-through data. In particular, the possibility of
deriving an interestingness model from click-through should be investigated
closer. A possible approach to how this can be done was sketched out.
o
Further development of visualization tools through improved interactivity and
more advanced graphs.
o
Implementing a text mining module capable of extracting feature and opinion
information from text comments in order to give better context information
about the rules.
Therefore we conclude that the goals of the project outlined in Section 1.2.1 were achieved.
61
CHAPTER 7. CONCLUSIONS
It should be emphasized that this paper presents a work in progress. The results presented here
are not final. Although much has been done, a lot is left for subsequent project phases. In particular, proper evaluation of results has not been carried out yet. Many of the suggested ideas
are based on the intuition about what is necessary and appropriate, and more reliable conclusions can be drawn when the solution described here has gone through a few more development
phases and has been exposed to real users for some time.
It is believed that the ideas, requirements, comments and general reasoning presented in this
paper will be of practical value for others developing a rule mining system. This paper discusses
a broad range of practical and theoretical details that need to be considered in a similar context.
Probably the most significant contribution of this work is that it shows how association rule
mining can be brought closer to real users in all parts of the executive hierarchy and how it can
be transformed into a tool that can be used on a daily basis to help people do their job better,
rather than reserving the technique for expert users or using it to facilitate exclusive high-end
consultant projects.
62
References
The books and articles referenced in the current thesis are listed below, ordered alphabetically by the names of the authors.
[1]
Aggarwal, C.C. Towards effective and interpretable data mining by visual interaction.
SIGKDD Explorations 3(2), pages 11-22, 2002.
[2]
Agrawal, R., Imielinski, T., and Swami, A. Mining association rules between sets of items
in large databases. Proceedings of the 1993 ACM SIGMOD international conference on
Management of data, pages 207-216, New York, NY, USA, 1993. ACM Press.
[3]
Agrawal, R. and Srikant, R. Fast algorithms for mining association rules. Proceedings of
the 20th International Conference on Very Large Databases, pages 487-499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
[4]
Alag, S. Collective intelligence in action. Manning, 2008. ISBN 1933988312.
[5]
Aumann, Y. and Lindell, Y. A statistical theory for quantitative association rules. Journal
of Intelligent Information Systems, 20(3): pages 255-283, 2003.
[6]
Bennedich, M. Mining survey data. Technical report, Royal Institute of Technology, Department of Numerical Analysis and Computer Science, Sweden, 2008.
[7]
Bennett, P. N. Assessing the calibration of naive Bayes’ posterior estimates. Technical
report CMU-CS-00-155, Carnegie Mellon University, School of Computer Science, Pittsburgh, PA, 2000.
[8]
Berry, M. and Linoff, G. Data mining techniques: for marketing, sales, and customer relationship management. 2nd edition. Wiley, 2004. ISBN 0471470643.
[9]
Blanchard, J., Guillet, F., and Briand, H. Exploratory visualization for association rule
rummaging. In MDM/KDD’03, Washington, DC, USA, 2003.
[10] Boser, B. E., Guyon, I. M., and Vapnik, V. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh, 1992. ACM Press.
[11] Burges, C. A tutorial on support vector machines for pattern recognition. Data Mining and
Knowledge Discovery 2, pages 121-167, 1998.
[12] Chen, Q. Mining exceptions and quantitative association rules in OLAP data cube. Technical report, Simon Fraser University, Canada, 1999.
[13] Cortes, C. and Vapnik, V. Support vector networks. Machine Learning, 20: pages 273297, 1995.
[14] Frei, H. and Schäuble, P. Determining the effectiveness of retrieval algorithms. Information Processing and Management, 27(2/3): pages 153-164, 1991.
[15] Fukuda, T. and Morishita, S. A visualization method for association rules. Technical report DE95-6, Institute of Electronics, Information and Communication Engineers, pages
41-48, 1995.
[16] Hahsler, M. Annotated bibliography on association rule mining. Visited on 2008-09-17.
http://michael.hahsler.net/research/bib/association_rules/
[17] Hand, D., Mannila, H., and Smyth, P. Principles of data mining. MIT Press, Cambridge,
MA, 2001. ISBN 0-262-08290-X.
[18] Hutcheson, M. C. Trimmed resistant weighted scatterplot smooth. Technical report, Cornell University, Ithaca, NY, 1995.
63
[19] Jain, A. and Dubes, R. Algorithms for clustering data. Prentice Hall, 1988.
[20] Jaroszewicz, S. and Scheffer, T. Fast discovery of unexpected patterns in data, relative to
a Bayesian network. Proceedings of 2005 ACM Int. Conf. on Knowledge Discovery in
Databases, pages 118-127, 2005.
[21] Joachims, T. Evaluating retrieval performance using clickthrough data. Proceedings of the
SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2002.
[22] Joachims, T. Optimizing search engines using clickthrough data. Proceedings of the ACM
Conference on Knowledge Discovery and Data Mining (KDD), 2002.
[23] Keim, D.A. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics 8(1), pages 1-8, 2002.
[24] Liu, B., Hsu, W., Chen, S., and Ma, Y. Analyzing the subjective interestingness of association rules. IEEE Intelligent Systems, 2000.
[25] Nagel, H.R., Granum, E., and Musaeus, P. Methods for visual mining of data in virtual
reality. In PKDD International Workshop on Visual Data Mining, 2001.
[26] Natarajan, R. and Shekar, B. Interestingness of association rules in data mining: Issues
relevant to e-commerce. Sadhana Vol. 30, Parts 2 & 3, pages 291-309, 2005.
[27] Obata, Y. and Yasuda, S. Data mining apparatus for discovering association rules existing
between attributes of data. US patent US006272478B1, 2001.
[28] Pearl, J. Bayesian Networks: A model of self-activated memory for evidential reasoning.
In Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, CA, pages 329-334, 1985.
[29] Säfström, M. 2005. Analysis of a support vector machine for visual classification. Technical report TRITA-NA-E05166, Royal Institute of Technology, Department of Numerical
Analysis and Computer Science, Sweden, 2005.
[30] Seragan, T. Programming collective intelligence. O’Reilly, 2007. ISBN 0596529325.
[31] Shen, X. and Zhai, C. Active feedback in ad hoc information retrieval. Proceedings of the
28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59-66, 2005.
[32] Srikant, R. and Agrawal, R. Mining quantitative association rules in large relational tables.
Proceedings of the ACM SIGMOD Conference on Management of Data, 1996.
[33] Tan, P., Kumar, V., and Srivastava, J. Selecting the right interestingness measure for association patterns. Proceedings of the 8th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, ACM Press, pages 32-41, 2002.
[34] Venkatesh, V., Morris, M., Davis, G., and Davis, F. User acceptance of information technology: toward a unified view. MIS Quarterly Vol. 27 No. 3, pages 425-478, 2003.
[35] Webb, G.I. Discovering associations with numeric variables. Proceedings of the 7th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 383388, New York, NY, USA, 2001. ACM Press.
[36] Wong, P.C., Whitney, P., and Thomas, J. Visualizing association rules for text mining.
Proceedings of the IEEE Symposium on Information Visualization, pages 120-123, 1999.
[37] Xin, D., Shen, X., Mei, Q., and Han, J. Discovering interesting patterns through user’s
interactive feedback. Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 773-778, 2006.
[38] Zhang, H., Jiang, L., and Su, J. Augmenting naive Bayes for ranking. Proceedings of the
22nd International Conference on Machine Learning, Bonn, Germany, 2005.
64
[39] Zhang, Z., Lu, Y., and Zhang, B. An effective partitioning-combining algorithm for discovering quantitative association rules. Proceedings of the First Pacific-Asia Conference
on Knowledge Discovery and Data Mining, 1997.
Several relevant sources within text mining not directly used in this thesis are listed
below for the benefit of the interested reader.
[40] Hu, M. and Liu, B. Mining opinion features in customer reviews. AAAI-2004, San Jose,
USA, July 2004.
[41] Hu, M. and Liu, B. Mining and summarizing customer reviews. KDD’04, Seattle, Washington, USA, August 2004.
[42] Liu, B. Web data mining. Springer, 2007. ISBN 978-3-540-37881-5.
[43] Spangler, S. and Kreulen, J. Mining the talk: unlocking the business value in unstructured
information. IBM Press, 2007. ISBN 978-0-13-233953-7.
65
TRITA-CSC-E 2009: 005
ISRN-KTH/CSC/E--09/005--SE
ISSN-1653-5715
www.kth.se