Download DOC Version

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Investigation of sub-patterns discovery and its applications
Xun Lu
1
Investigation of sub-patterns discovery and its applications
School of Computer and Information Science
(Honours)
Applying pattern discovery methods to a
healthcare data
Academic Supervisor: Associate Prof Jiuyong Li
Student: Xun Lu
ID: 100047661
Email: [email protected]
Xun Lu
2
Investigation of sub-patterns discovery and its applications
Disclaimer
I declare the following to be my own work, unless otherwise referenced, as defined by the
University’s policy on plagiarism.
Xun Lu
Xun Lu
3
Investigation of sub-patterns discovery and its applications
Abstract
Data mining is one of the most exciting information science technologies in 21
century. It has become an important mechanism that is able to interpret the
information hidden in data to human-understandable knowledge. It has been
heavily involved in a wide range of profiling practices, these include finance,
marketing, bioinformatics, genetic and medicine study, etc.
The data to be studied, in terms of their properties and relations, can vary greatly
form relational data, sequential data, graphs, models to classifiers, or the
combinations of these. Different data mining methods and algorithms can be
adopted to analyse different forms of data presentation so that the results are
assured to be interpretable and understandable.
Contrast patterns mining, more generally speaking, contrast groups mining or
contrast sets mining, is one of the most challenging and vital techniques in data
mining research. Patterns, or groups, are collections of items which satisfy certain
properties which are of interesting information [1]. In other words, patterns
represent different classes of objects, for example, American male and Russian
male, or the income changes in 2004 through 2009. Contrast patterns are the
conjunctions of attributes ad values that distinguish meaningfully in their
distribution across groups [2].
Contrast patterns of various kinds differ greatly, for example, Pattern and rule
based contrasts, Data cube contrasts, Sequence based contrasts, Graph based and
Xun Lu
4
Investigation of sub-patterns discovery and its applications
Model based contrasts. However, there is no one specific paper or research lays the
emphasis on comparing the similarities and differences between them.
This research, therefore, is intended to make a clear and comprehensive
comparison of different contrast patterns techniques. It firstly provides
background knowledge, which gives a grounding in data mining; then annotations
on relevant literature is shown along with the summary of deficiency in different
algorithms being implemented in various contrast sets. The thesis also provides a
critical survey of existing contrast patterns discovery methods.
One of the major data sources used in the research is from Domiciliary Care SA, a
government organization which takes care of disables and elderly people.
Different algorithms discussed in this thesis will employ the same data sauce from
Domiciliary Care SA to ensure that the results generated are comparable. A
detailed description of the data is presented on Chapter 4.
Key words: data mining, contrast patterns, association rules, risk patterns,
subgroup discovery.
Xun Lu
5
Investigation of sub-patterns discovery and its applications
Contents
Chapter 1 Introduction
10
1.1 Background
10
1.2 Motivation
10
1.3 Research Questions
12
1.4 Thesis Plan
13
Chapter 2 Literature survey
14
2.1 Background
14
2.2 Related work
22
Chapter 3 Methodology
23
3.1 Distinguishing different data mining techniques
23
3.1.1 What is correlation analysis, and why do we need it?
23
3.1.2 What are the correlation measures?
25
3.1.2.1 Chi-square χ2
25
3.1.2.2 Lift
27
3.1.2.3 Leverage
29
3.2 Distinguishing different data mining algorithms
3.2.1 STUCCO
Xun Lu
31
31
3.2.1.1 What is STUCCO
31
3.2.1.2 What technique does it use and how does it work
32
3.2.1.3 Advantages
33
6
Investigation of sub-patterns discovery and its applications
3.2.1.4 How STUCCO determines significant contrast sets
34
3.2.1.5 How does STUCCO do the pruning
34
3.2.1.5.1 Effect size pruning
34
3.2.1.5.2 Interesting based pruning
35
3.2.2 Magnum Opus
36
3.2.2.1 What is Magnum Opus
36
3.2.2.2 What technique does it use
37
3.2.2.3 Pruning technique from OPUS algorithm
38
3.2.3 MORE
42
3.2.3.1 What is MORE
42
3.2.3.2 Risk Patterns
43
3.2.3.3 What technique does MORE use?
44
3.2.3.4 Advantages
45
3.2.3.5 Results generated from MORE
45
Chapter 4 Data description
48
4.1 Data
48
4.2 Description of each data field
49
Chapter 5 Mining results discussion
53
5.1 Algorithm MORE
53
5.1.1 Data preparation for algorithm MORE
53
5.1.2 Result discussion for algorithm MORE
55
5.2 Algorithm Opus
62
5.2.1 Data preparation for algorithm Opus
63
5.2.2 Result discussion for algorithm Opus
64
Chapter 6 Conclusion and future work
71
Appendix A – Annotated Bibliography
74
Xun Lu
7
Investigation of sub-patterns discovery and its applications
Appendix B – Two other correlated measures: all_confidence and cosine
84
Appendix C – OPUS Algorithms
86
Appendix D – Background knowledge relating algorithm MORE
87
Appendix E – Data Field Description
90
References
93
Xun Lu
8
Investigation of sub-patterns discovery and its applications
List of Figures and Tables
1
Figure 2.1 Comparing UCI application over 1993-1998
19
2
Figure 2.2 Association rules for Bachelor and PhD degree holders
20
3
Figure 3.1 Illustration of how leverage of a rule is computed
30
4
Figure 3.2 Attribute-value pairs in a set-enumeration trees structure
32
5
Figure 3.3 A top-down pruning on tree structure
39
6
Figure 3.4 Half-way through the pruning process
40
7
Figure 3.5 The outcome of ordinary pruning
40
8
Figure 3.6 A re-orderd tree structure for OPUS
41
9
Figure 3.7 The outcome of OPUS pruning
42
10 Figure 5.1 a snapshot of .data file for both MORE and Magnum Opus
55
11 Figure 5.2 .name file for application Li-rule
56
12 Table 2.1 Mushroon table
17
13 Table 3.1 2×2 contingency table
24
14 Table 3.2 2×2 contingency table (with expected values)
27
15 Table 3.3 RR can be memorised easily from a contingency table
44
16 Table 4.1 data source from Domiciliary Care SA
50
Xun Lu
9
Investigation of sub-patterns discovery and its applications
Chapter 1
Introduction
1.1
Background
Data mining is one of the most exciting information science technologies in twenty-first
century. It has become an important mechanism that is able to interpret the information
hidden in data to human-understandable knowledge. It has been heavily involved in a
wide range of profiling practices, these include finance, marketing, bioinformatics,
genetic, engineering and medicine study, etc.
The data to be studied, according to their properties and relations, can vary greatly form
relational data, sequential data to graphs, models, classifiers, or the combinations of
these. Different data mining methods and algorithms can be adopted to analyse different
forms of data presentation so that the results are assured to be interpretable and
understandable.
1.2
Motivation
Why is contrast patterns research being undertaken so intensively? A comment from a US
cartoon Get Fuzzy by Darby Conley (2001) may shed some light on this question:
Xun Lu
10
Investigation of sub-patterns discovery and its applications
Sometimes it is good to contrast what you like with something else. It makes you
appreciate it even more. A general example is: if someone is only looking for a highest
mountain on the world, even though he has successfully found one, e.g. Mount Everest,
he could probably not able to get too much insight into it. However, if it is compared with
other mountains, such as the Alps, he may possibly get more understandings to the tallest
mountain he is researching on. A more specialised example can be: by focusing the sales
changes in 1998 through 2008 in just one department may not be more informative and
telling more details than by comparing the sales figures over the same period with two or
more similar departments.
The two examples above are telling us that by comparing/contrasting with other objects,
more information can be discovered. However, contrast patterns of various kinds differ
greatly, for example, Pattern and rule based contrasts, Data cube contrasts, Sequence
based contrasts, Graph based and Model based contrasts. Different contrast patterns
mining method aims to solve different data presentation, e.g. relational data, sequential
data to graphs, models, classifiers, or the combinations of these.
Different methods and algorithms are involved in solving different types of contrast
pattern. Currently, [3] classify the contrast patterns into categories shown as follow:

Pattern/Rule based contrasts
o Border differential algorithm
o Tree based algorithm
o Projection based algorithm
o ZBDD based algorithm

Contrast for rare class datasets
o Synthesising

Data cube contrasts
o Basic OLAP operations
o LiveSet-Driven Algorithm

Sequence based contrasts
o g-MDS Mining Problem
Xun Lu
11
Investigation of sub-patterns discovery and its applications
o ConSGapMiner algorithm

Graph based contrasts

Model based contrasts
o Rand index and Jaccard index
o Mutual information
o Clustering error
Under each first level dot point, the second level dot points list some of the methods or
algorithms which are related to solving the corresponding contrast patterns. There are,
however, still many more methods and algorithms haven’t been listed here. So this may
easily cause the confusions of using them. Given one contrast mining problem, what
methods are available? What are the pros and cons to using a certain method? Are there
any better improvements of the methods? This research, therefore, is intended to make a
clear and comprehensive comparison of different contrast patterns techniques and
provide improvements if possible. This leads to the research questions of the thesis.
1.3
Research Questions
This thesis aims to address the following research question:
How to discover contrast patterns efficiently in large data sets?
This research question can be broken down into three sub-questions which enable the
problem to be more precise and manageable.
1. What methods are available?
2. What are their strengths and weaknesses?
3. How to improve the methods?
Xun Lu
12
Investigation of sub-patterns discovery and its applications
1.4
Thesis Plan
The thesis contains the following sections. Section One provides some basic knowledge,
which gives a grounding in data mining. Section Two annotates on relevant literature.
Section Three provides a critical evaluation of existing contrast patterns discovery
methods and explains the methods that will be adopted to answer the research questions.
The Section Four, the final section, discusses the test results and concludes the thesis.
Xun Lu
13
Investigation of sub-patterns discovery and its applications
Chapter 2
Literature survey
This section gives a basic knowledge and an overview of the early works that are related
to the research topic.
2.1
Basic Knowledge
The basic knowledge for this research is given by various source of literature survey.
Two basic concepts of data mining: Support and Confidence.
They are two measurements of association rule. Support represents the usefulness and
Confidence reflects the certainty of discovered rules [22]. For example, digital camera 
memory card (support 2%, confidence 54%) means that digital camera and memory card
being purchased together in one transaction occupies 2% of all transactions (support);
54% of confidence means that 54% of customers who buy digital camera also buy
memory card.
Many data mining algorithms can be involved in discovering patterns, such as
Association rules and Decision trees, Clustering, Regression, etc.
Association rule is one of the most popular techniques in data mining [10]. They are in
forms of “If…Then…” format [16]. Take an association rule “Bread  Milk” as an
Xun Lu
14
Investigation of sub-patterns discovery and its applications
example, “IF a customer purchases bread, THEN this customer is likely to purchase
milk”. Association rules exhaustively look for hidden patterns, making them suitable for
discovering predictive rules involving subsets of the medical data set attributes [13]. This,
on the other hand, is a problem of association rules [15]. It tends to discover too many
rules that are trivial and repetitive particularly when the minimum support is set low.
Some other methods such as Mining non-redundant association rules [20] Association
rule discovery with the Train and Test Approach [17], mining most interesting rules [18]
and Mining k-optimal rules [19] etc. tend to overcome this problem, however, they still
suffer from other deficiencies, for example, they are difficult to understand by end users
[15].
The fundamental task of data analysis is to understand the meaning and the differences
between contrasting groups. This led to the development of a new special purpose data
mining technique, Contrast-set mining [4].
Contrast patterns mining, more generally speaking, contrast groups mining or contrast
sets mining, is one of the most challenging and vital techniques in data mining research.
Definition 1: Patterns (or Itemsets) are collections of items which satisfy certain
properties which are of interesting information [1]. In other words, patterns represent
different classes of objects.
Example 1: {computer, digital camera, memory card};
In Example 1, computer, digital camera and memory card are three items. This three-item
transaction from a computer store is a pattern. In other words, each single transaction can
be regarded as a pattern.
Before giving the definition of the Contrast patterns, it is necessary to define what an
attribute-value pair is. Attribute-value pair is a fundamental data representation in which
each item record may be expressed as a collection of tuples <attribute name, value>; each
element is an attribute-value pair. [31]
Xun Lu
15
Investigation of sub-patterns discovery and its applications
Definition 2: Contrast Patterns are the conjunctions of attribute-value pairs that
distinguish meaningfully in their distribution across groups [2].
Examples 2.1: MaritalStatus  divorced  Age  [60  69]
Examples 2.2: Degree  Ph.D  Gender  female  Income  $10,000
In Example 2.1, the contrast pattern contains two attribute-value pairs, i.e. people who are
divorced and people who are aged between 60 and 69. And the conjunction of these two
pairs forms a contrast pattern.
In Example 2.2, the contrast pattern consists of three attribute-value pairs, i.e. people with
PhD degree title, female and their income are over 10 thousand dollars. Anyone who falls
into this contrast pattern satisfies the attributes defined within the contrast pattern, i.e. a
female PhD holder whose income is above 10k dollars.
Contrasting specific groups of interest plays a key role in social science research. Bay &
Pazzani (2001, p.213) aims to automatically detect all the differences between contrasting
groups from observational multivariate data. Based on Bay and Pazzani (2001, p.217),
the data is a set of groups G1, G2, …, Gl where each group is a collection of objects O1, O2,
…, On where each object Oi is a set of k attribute-value pairs, one of each of the attributes
A1, A2, … , Ak. Attribute Aj has values drawn from the value domain set Vj1 , Vj2, …, Vjm.
They search contrast sets, the conjunctions of attributes and values that have different
levels of support in different groups. According to [4], a contrast set is a set of
attribute-value pairs with no attribute Ai occurring more than once. This is equivalent to
an itemset in association-rule discovery when applied to attribute-value data. [4] also
state that contrast set discovery is to find all contrast sets whose support differs
meaningfully across groups. This is defined as seeking all contrasts sets cset that meet the
both requirements:
Eq.1: ijP(cset | Gi )  P(cset | G j )
Xun Lu
16
Investigation of sub-patterns discovery and its applications
and
Eq.2: max | supp (cset , Gi )  supp (cset , G j ) |  
ij
where  is a user-defined threshold called the minimum support-difference. If contrast
sets meet the first requirement (Eq.1), this contrast sets are called significant. If the
second requirement (Eq.2) is met, then contrast sets are called large. It is also important
to note that contrast set must differ meaningfully across groups. While Eq.1 provides the
basis of a statistical test of ‘meaningful’, Eq.2 provides a quantitative test thereof. If both
requirements are met, then such contrast set is all deviation. Eq.1 means that for any
group i, which is a contrast set (cset), must differ from any other group j that is also a
contrast set (cset). In other words, the contrast set represents a true difference between
groups. In Eq.2, (cset, Gi) means i is a group which is also a contrast set, similarly, (cset,
Gj) means group j is also a contrast set. The difference of the probability of (cset, Gi) and
(cset, Gj) must be bigger than a pre-defined minimum support-difference, which is
denoted as . This equation ensures that every contrast pattern that is discovered has big
enough effect to be considered important by users [2].
According to [1], the recently proposed patterns for knowledge discovery from databases
are called Emerging Patterns (EPs). EPs, which capture significant changes and
differences between datasets, are defined as itemsets whose supports increase
significantly from one dataset, D1, to another, D2 [5]. More importantly, EPs are itemsets
whose growth rates are larger than a given threshold . The growth rates here mean the
ratios of their supports in D2 over that in D1 [5]. EPs can capture useful contrasts between
the classes, for example male vs. female, poisonous vs. edible.
The following is a typical example [3] which shows what the growth rate is and how EP
captures the significant change in class X to predict that a mushroom is edible or not:
X = {(Odour = none), (Stalk_Surface = smooth), (Ring_Number = one)}
EP
Supp of Poisonous
Supp of Edible
Growth Rate
X
0.2%
57.6%
288
Table.2.1: Mushroom table
Xun Lu
17
Investigation of sub-patterns discovery and its applications
It can be seen from the table, the support rate increases drastically from just 0.2%
poisonous to 57.6% edible with the growth rate of 288. We can predict that this
mushroom is 99.6% edible. 99.6% = 57.6% ÷ (57.6% + 0.2%).
Those EPs with very large growth rates are notable differentiating characteristics between
the edible and poisonous Mushrooms, and they have been useful for building powerful
classifiers [5]. The advantages of EPs are that it can be easily understood and used
directly by people, and EPs have been wildly used for predicting the likelihood of
diseases.
There are two different types of groups in data mining: (Bay and Pazzani, 2001 p.214)
one is based on time with groups falling into different years, which forms a trend. For
instance, observations spaced through time with one observation per time point, as
examples are shown in Figure 2.1; In contrast, another type is based on multiple
observations at a few discrete points in time. For example, we could have thousands of
World Cup soccer match data records in year 1998, 2002 and 2006.
Figure 2.1 (a)
Xun Lu
18
Investigation of sub-patterns discovery and its applications
Figure 2.1 (b)
Fig.2.1: Comparing UCI application over 1993-1998. (a) Admitted ICS student with SAT Math > 700  SAT
Verbal >700; (b) UCI applicants with Admit = Yes  Home Location = Los Angeles County  School Type =
Public. (Bay & Pazzani 2001 p.214)
Other mining approaches such as Decision Tree and Rule Learner* can also be used to
discover patterns. Despite the advantage of being fast, [2] have comments on their
weaknesses:
 Rule learners and decision trees are not complete;
 Rule learners and decision trees may miss groups which are important;
 The interpretation of a rule may be difficult if all previous rules were not satisfied;
 It is difficult to specify useful criteria in the classification framework.
However, they fail to provide more details on why they have these disadvantages, which
leaves a very good opportunity for my research to undertake a further investigation on
them.
*
Decision Tree, Rule Learner and some other typical patterns mining methods will be presented in more
detail in Chapter 2.2 Related works.
Xun Lu
19
Investigation of sub-patterns discovery and its applications
Association rule mining [10] is closely related to contrast set discovery. Association rules
are relations between variables of the form X  Y, where X and Y can represent any
items such as beer and diaper respectively, means that people bought beer also bought
diaper. X and Y can also represent category data such as degree=PhD or income
≥$10,000. Then X  Y means people who are PhD degree holders earn over $10,000.
According to [2], the techniques involved in finding association rules and mining contrast
sets are of many commonalities because they both require search through a space of
conjunctions of items or attribute-value pairs. In association rule mining, we look for sets
that have support greater than a certain threshold and for contrast sets, on the other hand,
we look for those sets which represent substantial differences in the underlying
probability distributions. Bay and Pazzani therefore tried to directly apply association
rule mining algorithms to find contrast sets but it turned out that the results are really hard
to interpret and moreover, there may be a chance to lose the pruning opportunities which
can significantly improve the mining efficiency.
Figure 2.2 (a)
Figure 2.2 (b)
Fig.2.2: Association rules for Bachelor and PhD degree holders. Rules are in the form X  Y (support,
confidence). (a) First 10 of 26796 association rules for Bachelor holders; (b) First 10 of 1674 association rules for
PhD holders. [1]
As can be seen from Fig.2.2, it is really difficult to draw a conclusion about the
relationship Bachelor and PhD degree holders because:

There are too many rules to compare;

The result are difficult to interpret as they are not consistent contrast* [11];
Xun Lu
20
Investigation of sub-patterns discovery and its applications

A proper statistical comparison is needed to see if differences in support and
confidence are significant.
*
Consistent contrast means the attribute to compared in separate groups are the same
To handle these problems, [2] present a method called STUCCO (Search and Testing for
Understandable Consistent Contrasts) which runs efficiently and can mine at low support
differences without being overwhelmed with large itemsets.
[4] tried to prove that contrast-set mining is a special case of the more general
rule-discovery task by comparing three alternative data mining techniques. They are
STUCCO, Magnum Opus and C4.5. The reason that these three methods are selected is
because, for STUCCO, it is the only data mining approach designed for identifying
contrasts between groups; for Magnum Opus and C4.5, they are considered to be suited to
performing the type of contrast analysis. Magnum Opus is a general purpose
rule-discovery system which implements the OPUS_AR rule-discovery algorithm [4]. It
does not require the specification of a minimum-support contrast because it does not use
the frequent-itemset strategy. However it has association-rule functionality. C4.5 [8], on
the other hand, discovers classification rules by first discovering a decision tree, then
converting that tree to an equivalent set of rules, then simplifying those rules.
After the observation of the results run by three data mining techniques (STUCCO,
Magnum and C4.5) independently, [4] found that Magnum Opus has produced rules
corresponding to all contrast set found by STUCCO. Contrast set discovery and general
rule discovery constrained to use only the group identifier as a consequent both seek to
identify equivalent situations. They only differ in the way how they assess whether a
group difference in meaningful. In addition, according to [4], the main distinction
between STUCCO and Magnum Opus is in the application of filters that seek to identify
and remove spurious contrast sets. Magnum Opus uses a binomial sign test while
STUCCO uses a chi-square test which is believed is a better test for the
contrast-discovery task because it is more sensitive to a small range of extreme forms of
contrast.
Xun Lu
21
Investigation of sub-patterns discovery and its applications
One of the tasks of this research is to answer how to efficiently find out the patterns (i.e.
association rules, A (antecedent)  C (consequent)) in data set. There are many ways of
improving the efficiency of association rules discovery algorithms. [7] argues that some
applications direct search for association rules can be more efficient than Apriori
algorithms. A more detailed description is shown on chapter 3.
2.2
Related work
A related work that I have been doing from year 2008 is investigating the risk patterns
from a data set provided from Domiciliary Care SA. Domiciliary Care SA is a state
government organisation that provides services to people with reduced ability to care for
themselves, helping them to stay in their own homes, living closely to their loved ones,
family and local community [21]. This work was about analysing the results (risk
patterns) generated from Lirule application. The application uses MORE (Mining
Optimal Risk pattErn sets) Algorithm [15]. What makes the MORE Algorithm stands out
from other association rule mining algorithms is that it makes use of the anti-monotone
property to efficiently prune the search space.
An anti-monotone property of frequent itemsets is defined here: An itemsets is frequent if
its support is greater than the minimum support. An itemset is potentially frequent only if
all its subsets are frequent [15]. This property limits the number of itemsets to be
searched and, consequently, improves the efficiency of the algorithm.
Chapter 5 will present and discuss the results generated by application Lirule.
Xun Lu
22
Investigation of sub-patterns discovery and its applications
Chapter 3
Methodology
3.1
Distinguishing different data mining techniques
Before comparing the difference of the following techniques, it is essential to understand
the meanings and characteristics of the measurement rules, i.e. the correlation measure
between them.
3.1.1 What is correlation analysis, and why do we need it?
Using association rule AB [support, confidence] measurement is a very basic and
fundamental approach to determine whether the rule is interesting or not to users. On the
other hand, because of its basis of measurement, there is a possibility that the results are
misleading and incorrect. To overcome this, rather using association rule AB [support,
confidence], we add one more measurement, correlation, into it, which becomes
correlation rules of the form AB [support, confidence, correlation].
A number of correlation measures are wildly used, such as lift, χ2 (chi-square test),
all_conf and cosine etc.
Xun Lu
23
Investigation of sub-patterns discovery and its applications
The following 2×2 contingency table is an example showing why using support and
confidence alone does not suffice to make a sound determination. Considering the data
below:
Milk
¬Milk
Σrow
Sugar
160
140
300
¬Sugar
80
20
100
Σcolumn
240
160
400
Table 3.1
In the table, “Milk” represents the number of transactions that contain milk whereas
“¬Milk” represents the number of transactions that does not contain milk. Same reading
applies to “Sugar”. So, among the total 400 transactions, we have 160 transactions that
contain both Milk and Sugar and 140 transactions that contain Sugar but no Milk,
Similarly, 80 transactions contain Milk but no Sugar and 20 transactions that contain
neither Milk nor Sugar.
According to association rule AB [support, confidence], we have MilkSugar [40%,
66.7%], which is calculated from support =
160
160
 40% , confidence =
 66.7% .
400
240
Providing the giving minimum support threshold and minimum confidence threshold are
35% and 65% respectively. The results we got show that not only does the rule
MilkSugar satisfy the minimum support and confidence, it also seems very interesting
and significant to users. However, if we take a closer look at the data in the table, it is not
difficult to find out the percentage of purchasing Sugar P(Sugar) is
300
 75% , which is
400
much higher than purchasing Milk and Sugar together (66.7%). This means, purchasing
Milk and Sugar does not increase the possibility of purchasing Sugar. Milk in fact
negatively affects the likelihood of buying Sugar.
The example above depicts that the confidence of a rule AB can be misleading in that it
is only an estimation of the conditional probability of itemset B given itemset A [22]. It
Xun Lu
24
Investigation of sub-patterns discovery and its applications
does not reflect the actual relations between itemset A and B. This is the reason why
Correlation analysis is introduced from Association analysis.
3.1.2
What are the correlation measures?
Some popular correlation measures are χ2 (chi-square), lift, all_conf and cosine etc. The
following illustrates how each rule calculates and determines if the given rule is
interesting or not. And at the end, their advantages and disadvantages will be discussed.
3.1.2.1
Chi-square χ2
Chi-square is used for calculating the correlation relationship between two attributes A
and B with categorical data [22]. The results show whether attributes A and B are
independent or not.
It is computed as:
2
(
o

e
)
 2    ij ij
i 1 j 1
eij
c
where
r
, (Equation 1)
oij is the actual count of the joint event (A , B ) and eij is the expected
i
j
frequency of (Ai, Bj), which is computed from:
count ( A  ai )  count ( B  b j )
eij 
, (Equation 2)
N
where N is the total number of data, count(A = ai) is the number of rows which have value
ai for A, similarly, count(B = bj) is the number of rows which have value bj for B.
Xun Lu
25
Investigation of sub-patterns discovery and its applications
The result of χ2 equation is the sum of all the r rows × c columns in the contingency table.
To illustrate how χ2 works, we again use a same 2×2 contingency table:
Milk
¬Milk
Σrow
Sugar
160
140
300
¬Sugar
80
20
100
Σcolumn
240
160
400
In the table, we have two attributes: Milk and Sugar. The actual frequencies arer shown in
every grid of table. Now we need to calculate the expected frequency of every joint event
(Ai, Bj).
According to Equation 2, the expected frequency for grid (Milk, Sugar) is:
eMilk, Sugar 
count ( Milk )  count ( Sugar ) 240  300

 180 ;
N
400
The expected frequency for grid (¬Milk, Sugar) is:
eMilk,Sugar 
count (Milk )  count ( Sugar ) 160  300

 120 ;
N
400
The expected frequency for grid (Milk, ¬Sugar) is:
eMilk,Sugar 
count ( Milk )  count (Sugar ) 240  100

 60 ;
N
400
And lastly, the expected frequency for grid (¬Milk, ¬Sugar) is:
eMilk,Sugar 
Xun Lu
count (Milk )  count (Sugar ) 160  100

 40 ;
N
400
26
Investigation of sub-patterns discovery and its applications
Milk
¬Milk
Σrow
Sugar
160 (180)
140 (120)
300
¬Sugar
80 (60)
20 (40)
100
Σcolumn
240
160
400
Table 3.2
Every expected value is written in the brackets accordingly in the table. It should be noted
that the sum of expected frequency is equal to the sum of actual count, i.e. the observed
frequency.
Now we are able to calculate the Chi-Square based on Equation 1 provided above:
(160  180) 2 (140  120) 2 (80  60) 2 (20  40) 2
 



180
120
60
40
400 400 400 400




 2.22  3.33  6.67  10  22.22
180 120 60
40
2
For this 2×2 table, the degrees of freedom are (2-1) × (2-1) =1. For 1dregree of freedom,
the χ2 values needed to reject the hypothesis at the 0.001 significance level is 10.828.
[22]. This means, if the chi-square value we calculated is higher than 10.828, which
means attributes Milk and Sugar are not independent, in other words, they are correlated.
3.1.2.2
Lift
“Lift” is probably the most commonly used metric to measure the performance of
targeting models in marketing applications [23]. Compared to Chi-square, lift is simple
correlation measure that does not require much calculation to tell whether two attributes
are correlated (dependent) or not. It is computed as:
lift ( A, B) 
Xun Lu
P( A  B)
P( A) P( B)
(Equation 3)
27
Investigation of sub-patterns discovery and its applications
Lift is to measure whether the given two attributes, A and B, are independent or not. At
this stage, we are only concerned whether if value of lift is equal to 1, greater than 1 or
less than 1.
① When lift (A, B) =1:
i.e. P( A  B)  P( A) P( B) , in other words, the probability of A and B together is equal to
the probability of A times the probability of B, then they are independent;
② When in the second situation, lift (A, B) >1:
i.e. P( A  B)  P( A) P( B) , we say A and B are positive correlated (dependent), which
means the occurrence of one implies the occurrence of the other;
③ When in the third situation, lift (A, B) <1:
i.e. P( A  B)  P( A) P( B) , we say A and B are negatively correlated (dependent), which
means the occurrence of one implies the occurrence of the other.
Both situations ② and ③ tell that attributes A and B are dependent, so what is the
difference between them and how do positive correlated and negatively correlated differ?
To answer this question and overcome the problem discussed section 1.1, let us use the
example Milk and Sugar example again.
160
P( Milk  sugar )
0.4
400
lift ( Milk , Sugar ) 


 0.89
P( Milk ) P( Sugar ) 240  300 0.45
400 400
The computed value lift 0.89 is less than 1, which tells that Milk and Sugar are negatively
correlated. What it actually means is that, customers who buy Milk in one transaction are
less likely to buy Sugar. Or customers who buy Sugar in one transaction are less likely to
buy Milk. Because Milk and Sugar negatively influence each other, they are unlikely to
happen together.
Xun Lu
28
Investigation of sub-patterns discovery and its applications
We can also get the same explanation by looking at the fraction
0 .4
. Value 0.4
0.45
represents the possibility of milk and sugar being purchased together whereas value 0.45
represents the likelihood would have been if two purchase were completely independent
[22].
This clearly explains why, although the confident of Milk and Sugar conf(milk, sugar) is
high (66.7%), it does not necessarily mean that the association rule MilkSugar is
strong, or the possibility of the purchase of sugar would be increased because of the
purchase of milk. This also depicts that the results generated by using support and
confidence could be misleading or even incorrect.
3.1.2.3
Leverage
Leverage is a correlation rule which is used in algorithm Magnum Opus.
By definition [24],
“The leverage of a rule is the number of additional cases covered by both the
LHS (Left-Hand-Side) and RHS (Right-Hand-Side) above those expected if
the LHS and RHS were independent of each other. This is a measure of the
importance of the rule that reflects both the strength and the coverage of the
rule.”
Here is an example to illustrate how leverage of a rule A(LHS)B(RHS) is computed:
Xun Lu
29
Investigation of sub-patterns discovery and its applications
LHS
RHS
50
200
100
Total 1000
Figure 3.1
Consider the fact that there are 1000 items (i.e. cases) in total, item A (LHS) has 200
cases, and item B (RHS) has 100 cases. The occurrences of items A and B ( A  B ) are
50.
Obviously, the proportion of A  B is 50/1000=5%. This is the proportion from raw
count. The expected proportion of A  B is calculated from the proportion of A times the
proportion of B:
200 100

 2% , which is 20 (2%  1000=20) cases, providing item A
1000 1000
is independent from item B.
Hence:
Leverage count is: 50  20  30
Leverage proportion is: 30 1000  0.03
As the definition of the leverage implies, the result computed by leverage indicates the
importance of the rule. The higher value of the leverage, the more important the rule will
be. This is quite straightforward. In the previous example, the leverage count of rule A 
B is 30, this means item A and item B are correlated; otherwise it would not have been
any difference from raw count value to expected value, i.e. leverage count is zero, which
means item A and item B are independent.
Xun Lu
30
Investigation of sub-patterns discovery and its applications
There are other useful correlation measures, such as all_confidence and cosine. However,
these measures have not been applied in the data mining algorithms discussed in the
thesis. The detailed comparison between them and the three above is given in Appendix
B.
3.2
Distinguishing different data mining algorithms
Most of the time, once all the contrast sets that satisfy the significance and large
requirements are found, they may not present the real needs to users. This is because most
of contrast sets found present no interest to users.
It is an open challenge in data mining to decide if a given contrast set is interesting. We
need to apply certain techniques to handle this problem, this is generally called
post-pruning technique. This chapter focuses on three algorithms which use different
method to post-prune the uninteresting contrast sets. These three algorithms are
STUCCO from Stephen D Bay and Michael J Pazzani; Magnum Opus from Geoffrey
Webb; and MORE from Jiuyong Li
3.2.1 STUCCO
3.2.1.1
What is STUCCO
STUCCO is an acronym for Search and Testing for Understandable Consistent Contrasts.
It is actually a pruning rule incorporates with heuristic functions to limit the complexity
of searching contrast sets in a huge search space [2].
Xun Lu
31
Investigation of sub-patterns discovery and its applications
For example, in some cases, if the given support is set to very low, this will end up a huge
number of qualified contrast sets, which increase the difficulty and slows the searching
speed of the mining algorithms. However, in practice, STUCCO is able to handle this
efficiently. This will be discussed in detailed in section 3.2.1.2.
3.2.1.2
What technique does it use and how does it work
Let us examine attribute-value pairs represented in a set-enumeration trees structure [25,
26] first. The domain of the attributes is {milk, tea, sugar, coffee}.
Figure 3.2
We can regard each item embraced in a curly brackets represents an item being purchased
in one transaction. For example, node {tea, sugar, coffee} means these three items
happened to appear in one of the transactions from a customer.
STUCCO organises the nodes with the same parent into one group [2]. For example, node
{tea, sugar} and node {tea, coffee} at the third level from the top belong to one group,
because they have the same parent, also called heading, {tea}. Similarly, node {milk, tea,
sugar} and node {milk, tea, coffee} belong to one group with the heading {milk, tea}
which, in turn, share the same heading {milk} with other two nodes {milk, sugar} and
{milk, coffee}.
Xun Lu
32
Investigation of sub-patterns discovery and its applications
Within each group, STUCCO also organise two lists of items [2]: the head h(g) and the
tail t(g). The head h(g) is what we just demonstrated, represents the same prefix in one
group. The tail t(g) is all the items apart from the prefix h(g). For example, let us consider
group {milk, tea}, {milk, sugar}, {milk, coffee}. The h(g) is {milk} and the tail t(g) is
{tea, sugar, coffee}.
3.2.1.3
Advantages
As aforementioned, the purpose of STUCCO organising items in each group into two
lists, i.e. the head h(g) and the tail t(g), is to improve the efficiency of searching the
interesting itemsets.
First the headings: Given any itemset, simply check if its prefix matches any of the
predefined headings. If it does not, we can immediately judge that this itemset does not
meet our requirements and we do not need to further check each combination of items in
the itemset. For example, if we have already known that the itemset, for example, {milk,
sugar} does not meet our requirement. When given another itemset {milk, sugar, coffee}
we can ensure that this itemset is not interesting, because it matches the headings {milk,
sugar}, which has been defined as not interesting itemset. In other words, STUCCO only
examines the itemsets that match the interesting prefix, ignores those do not, so that the
searching speed is improved in this way.
The second advantage of keeping the heading and tailing is to define the upper bound and
lower bound of the support. This is a very important and useful technique used in pruning
method.
The upper bound and lower bound of the support is computed by counting every support
of h(g)  i where i is an itemset which belongs to h(g) and h(g)  t(g) [2]. The upper
bound of the support means the maximum value of the given itemset which cannot be any
greater. We know that any support of itemset is less than or equal to the support of its
subset. For example, the support of {sugar, coffee} is 60%, then we can be sure that the
Xun Lu
33
Investigation of sub-patterns discovery and its applications
support of its super set {sugar, coffee, milk} cannot be greater than 60%. SUTCCO is
counting every support of the itemset that is in the group of a same heading.
3.2.1.4
How STUCCO determines significant contrast sets
Based on the definition on [27], the p-value is a statistical term that represents the
probability of obtaining a test statistic at least as extreme as the one that was actually
observed, assuming that the null hypothesis is true. The lower the p-value, the less likely
the null hypothesis will be true, the more "significant" the result will be. If the p-value is
less than 0.05, the null hypothesis will usually be rejected.
STUCCO applies the chi-square to test the independence of variables in contingency
tables. Then it uses χ2 value in conjunction with p-value to compare the α (the value of
maximum probability of rejecting the null hypothesis when it is true. It is usually set to
0.05 for a single test). If the p-value is less than 0.05, then the contrast set is independent
(i.e. it is significant).
3.2.1.5
How does STUCCO do the pruning
STUCCO applies three techniques to prune those contrast sets which do not meet the
requirements and interesting criteria. The three techniques are Effect size pruning,
Statistical significance pruning and Interesting based pruning. In the following sections, I
will concentrate on discussing Effect size pruning and Interesting based pruning.
3.2.1.5.1
Xun Lu
Effect size pruning
34
Investigation of sub-patterns discovery and its applications
The effect size pruning technique optimises the support. STUCCO prunes those contrast
sets with the upper bound support below δ.
A support of a contrast set is generally decided by experienced users. However, the
quality of the generated result will be reduced if the support is not properly set. For
example, if the support is set too high, many of the interesting contrast sets will be
missing; whereas, if the support is set too low, the results may contain too much trivial
information which is of no value to users at all. Moreover, too many uninteresting
contrast sets will increase the searching and pruning time, which in turn will decrease the
efficiency. The following is the theorem [2] which STUCCO bases on to determine the
optimal support rate:
Theorem 1. Let U[i] be an upper bound and let L[j] be the lower bound of
support for groups i and j. Then the following is an upper bound on the
support difference between any two groups.
 max  max U [i]  L[ j ]
ij ,i  j
This equation takes every pair of groups into consideration and takes the maximum δ as
an upper bound [2]. Any contrast set with its support lower than δ will be pruned.
3.2.1.5.2
Interesting based pruning
Some contrast sets pass the minimum support threshold, but they are not interesting to
users, because they present no useful or new information to users.
For example, if a contrast set Mobile_Bill_ Expense > $50  Gender = Male meets the
requirement
and
is
interesting
to
user,
another
specialised
contrast
set
Mobile_Bill_Expense > $50  Gender = Male  has_Mobile = Yes also meet the
requirement and is interesting to user. But the later one does not provide more
Xun Lu
35
Investigation of sub-patterns discovery and its applications
information than the first one because when we know the first contrast set, it also implies
that people who with the mobile phone bill must have mobile phone.
To conclude Interesting based pruning, STUCCO has two conditions to determine if a
contrast set should be pruned or not [2]:
iP (cset  True | Gi )  P(cset '  True | Gi )
(Equation 4)
max | sup port (cset , Gi )  sup port (cset ' , Gi ) |  s
(Equation 5)
i
Where cset’ is a contrast set that is a specialization of contrast set cset. Using the example
above, Mobile_Bill_Expense > $50  Gender = Male  has_Mobile = Yes is a
specialization contra set of Mobile_Bill_ Expense > $50  Gender = Male which is
denoted as cset.
It is obvious that if the support of cset’ is the same as the support of cset, then cset’ does
not provide any new information, simply is it the same meaning to cset, therefore, if
iP (cset  True | Gi )  P(cset '  True | Gi ) , then this specialization contrast set should be
pruned.
In equation 5,  s is normally set to 1% (Bay & Pazzani 2001). This means, if the
maximum support difference between a contrast set and its specialized contrast set is not
greater than  s , than this specialized contrast set should also need to be pruned.
3.2.2
3.2.2.1
Xun Lu
Magnum Opus
What is Magnum Opus
36
Investigation of sub-patterns discovery and its applications
Magnum Opus is a commercial application which applies OPUS_AR rule discovery
algorithm [4]. OPUS_AR is an extended version of OPUS search algorithm.
Magnum Opus uses OPUS systematic search approach to perform association rule
search.
What OPUS_AR differs from OPUS is that OPUS_AR does not limit the RHS of a rule to
be single attribute-value pair [4]. However, this improved character from OPUS_AR. will
not be concerned too much in this thesis.
What will be discussed is the searching and pruning principles which are adopted by both
OPUS_AR and OPUS.
3.2.2.2
What technique does it use
The measurements of rule that Magnum OPUS used are support, confidence (strength),
lift, coverage and leverage, however, the default measure used by Magnum is Leverage
[4].
Leverage measures the magnitude of the difference between the actual (observed)
accounts of frequency and the expected counts of frequency. The equation as follow:
Leverage(a  c)  sup p(a  {c})  sup p(a )  sup p({c})
Where supp (a  {c}) represents the actual count of the rule a  c and supp (a )  supp
({c}) is the expected value.
Magnum OPUS prunes all the rules that do not satisfy the following condition:
x  a P(c | a )  P(c | x )
Xun Lu
37
Investigation of sub-patterns discovery and its applications
where a is a generation (specialized contrast set) of x. So the above condition means for
all contrast set x that belongs to contrast set a, if the proportion of the consequence from a
is not greater than the proportion of consequence from x, then this contrast set a should be
pruned.
For example:
x(1, 2)
a1(1, 2, 4)
a2(1, 2, 6)
The consequence c is, for example, set to (5, 9), which means we now have
x c
i.e. (1, 2)  (5, 9)
a1  c
i.e. (1, 2, 4)  (5, 9)
and a2  c
i.e. (1, 2, 6)  (5, 9)
If the proportion of x  c is 60% and the proportion of a1  c is 60% as well, then the rule
a1  c should be pruned because this does not add any extra useful information from its
parent rule as its support is not greater than the support of its parent node.
3.2.2.3
Pruning technique from OPUS algorithm
In this section, I will focus on discussing how OPUS algorithm increases its searching
speed by improving its pruning method.
Let’s use an example to illustrate this pruning technique: using the example in section
3.2.1.2, we have milk, tea, sugar and coffee four elements in search space. According to
the grocery transaction data, the following traction records were made:
{milk},
Xun Lu
38
Investigation of sub-patterns discovery and its applications
{tea},
{sugar},
{coffee},
{milk, tea}, {milk, sugar}, {milk, coffee},
{tea, sugar}, {tea, coffee},
{sugar, coffee},
{milk, tea, sugar}, {milk, tea, coffee}, {milk, sugar, coffee},
{tea, sugar, coffee},
{milk, tea, sugar, coffee}
All these transaction records can be rearranged into a tree structure which facilitates the
pruning process. The structure is usually fixed, so called fixed-structure. The
fixed-structure tree is like this:
Figure 3.3
The figure 3.3 above shows the fixed- structure tree that ordinary search algorithm will
follow the top-down order to perform the searching and pruning. For example, if an
algorithm is searching an itemset {sugar, coffee} which is not significant (is not
interesting to user) and should be pruned.
Xun Lu
39
Investigation of sub-patterns discovery and its applications
Figure 3.4
In general, rule discovery algorithm knows that if the support of a given itemset below a
predefined threshold, then all the supersets of the given itemset cannot contain the
solutions. Therefore they should also be pruned. In this example, any itemset that
contains {sugar} (except single item itself, i.e. itemset {sugar} will be kept) will be
eliminated. These itemsets are: {milk, tea, sugar}, {milk, tea, sugar, coffee}, {milk,
sugar}, {milk, sugar, coffee}, {tea, sugar} and {tea, sugar, coffee}.
Ultimately, after the pruning, the result will look like below:
Figure 3.5
Xun Lu
40
Investigation of sub-patterns discovery and its applications
The purpose of this pruning is to ensure that the next step rule discovery to be as simple as
possible. As can be easily noticed, the efficiency is very low by doing in this way, not
until every node has been searched (coloured in red) and the target is reached, can only
one node then be pruned. To achieve the result in figure 3.5, the ordinary algorithm needs
to go through the fixed-structure tree seven times. This is because every search only
eliminates one node which contains no result, we have seven nodes that contain element
{sugar}, so seven top-down searches are needed.
The search space of the example only contains four conditions (milk, tea, coffee and
sugar) to be analysed. When the conditions reach ten thousands, the search space will
increase exponentially to 210000 [4]. The search space with this huge size is not possible to
be pruned. Therefore, an improved searching algorithm is needed.
The following is showing the strategy of how OPUS progresses the traditional searching
sequence by changing the fixed-structure tree into a reordered-structure tree, in order to
reduce the number of search.
The idea is simple: to reorder the search space so that any node to be pruned will be
placed precedes all other ones not to be pruned.
Figure 3.6
Xun Lu
41
Investigation of sub-patterns discovery and its applications
As the figure 3.6 showing above, the biggest different is that the tree structure has been
restructured so that all the nodes containing unwanted element {sugar} have higher
priority than the ones without. They are placed before others. The advantage of this is
when the pruning algorithm search through the tree to prune the uninteresting node, only
one search will achieve the same result of seven searches from fixed-structure tree. The
efficiency has been greatly increased in this way.
The outcome form figure 3.7 is identical to the outcome from figure 3.5.
Figure 3.7
The detail of OPUS search algorithm is at Appendix C.
3.2.3 MORE
3.2.3.1
What is MORE
MORE stands for Mining Optimal Risk pattern Sets. Algorithm MORE discovers risk
patterns in data, especially in medical data [15].
Xun Lu
42
Investigation of sub-patterns discovery and its applications
Definition of Risk Patterns: Risk Patterns are patterns whose local
support and relative risk are higher than the user specified minimum
local support and relative risk threshold, respectively. [15].
3.2.3.2
Risk Patterns
The definition of pattern is similar to the definition of contrast set: A pattern is a set of
attribute-value pairs. For example: {Nationality: Japanese, Gender: Male, Height:
[165cm-169cm] , Weight: 80kg} is a pattern (contrast set) with four attribute-value pairs.
In the study of risk patterns, we have a new concept called target attribute which is of two
values: abnormal and non-abnormal [15]. In medical term, target attribute abnormal
means given a set of attribute-value pairs, this patient is regarded or classified as risk, e.g.
this patient is more likely to get a certain disease. On the other hand, target attribute
non-abnormal, i.e. normal, means this person is healthy and low risk.
Take the attribute-value pairs above as an example again, for people who belong to this P
= {Nationality: Japanese, Gender: Female, Height: [165cm-169cm], Weight:
[70kg-79kg]}, we regard the probability of getting heart disease is high, denoted as (P 
a), where target attribute a stands for abnormal.
The value relative risk (RR) measures the likelihood of abnormal from certain attributes
compared with the cohort without these certain attributes. The calculation of RR is
defined as:
RR ( P  a ) =
prob ( P, a ) prob ( P)
prob(a | P)
=
prob (P, a ) prob (P)
prob(a | P)
=
sup( Pa) sup( P)
sup( Pa) sup( P)
=
sup( Pa) sup( P )
sup( Pa) sup( P )
This formula can be memorised easily from a contingency table:
Xun Lu
43
Investigation of sub-patterns discovery and its applications
Abnormal (a)
Non-abnormal (n)
Total
P
① prob(P, a)
prob(P, n)
② prob(P)
¬P
③ prob(¬P, a)
prob(¬P, n)
④ prob(¬P)
Total
prob(a)
prob(n)
1
Table 3.3
RR ( P  a ) = (①×④) ÷ (③×②).
The cohort with pattern P is classified abnormal if the value of RR is greater than 1; the
cohort with pattern P is classified non-abnormal if the value of RR is smaller than 1[28].
3.2.3.3
What technique does MORE use?
The principle that MORE algorithm applies is simple yet efficient.
According to Li etc. 2008, ordinary methods need to firstly go through the whole data sets
to find out all frequent patterns. (A frequent pattern is a pattern which support is higher
than the minimum support). Then use relative risk to form rules to replace confidence.
Lastly, post-pruning is applied to prune the huge amount of frequent but uninteresting
rules. When the minimum support is set low, a large number of rules will be generated,
which in turn will take much longer time to undertake the pruning process.
How MORE algorithm distinguishes other ordinary rule mining algorithms and
overcomes this issue is making use of the anti-monotone property.
Anti-monotone property: if {a, b} is infrequent, then all its supersets of
{a, b} are infrequent.
MORE algorithm is based on a lemma and a corollary to efficiently mine optimal risk
pattern sets.
Xun Lu
44
Investigation of sub-patterns discovery and its applications
Lemma: if (supp(Px¬a)=supp(P¬a)), then pattern Px and all its supper
patterns do not occur in the optimal risk pattern set. (Li etc, 2008).
Corollary: if (supp(Px)=supp(P)) then pattern Px and all its super
patterns do not occur in the optimal risk pattern set. (Li etc, 2008).
The detail of MORE algorithm and the proof of the lemma are in Appendix D.
3.2.3.4
Advantages
According to Li et. 2008, Most association rule mining methods are facing three
challenges:
(1) those association rule mining approaches do not suit medical data
analysis. This is because most mining algorithms rely upon confidence
and lift. But MORE algorithm uses Relative Risk.
(2) truly interesting rules are overwhelmed by too many uninteresting or
trivial rules
(3) when the minimum support is set low, the efficiency of an association
mining algorithm becomes very low.
3.2.3.5
Results generated from MORE
The outcome of MORE is contingency tables (patterns) represented in tree structure.
Each table represents a cohort of people with the same pattern.
Xun Lu
45
Investigation of sub-patterns discovery and its applications
A recent research uses MORE to analyse data from Domiciliary Care SA. After running
the Lirule program (a program uses MORE algorithm to generate results), there are over
60 risk patterns generated, but not all of these patterns are useful or practical. The patterns
with very low Odds Ratio (OR) will not be considered. The odds ratio is a measure of
effect size particularly important in Bayesian statistics and logistic regression [29]. Low
odds ratio means this pattern is not significant enough to distinguish from the ordinary
patients.
A risk pattern with odds ratio 4.34, for example, means that the patients with these
specific attributes are 4.34 times more likely to be staying short comparing to those
without such attributes combination. So, high odds ratio deserves more investigation.
In addition, some patterns with similar attributes combination will only be selected one of
them, or select the subset of these similar patterns. The reason for this is to differentiate
between patterns as much as possible. The more distinguish between the patterns, the
more informative result will be provided.
A risk pattern with the minimum support 5%, for example, means that 5% of all the
attributes combinations under study show that such attributes combination in this risk
pattern happen together. Similarly, a risk pattern with the minimum support 10% means
that 10% of all the attributes combinations under study show that such attributes
combination in this risk pattern happen together.
For example, the table below is one of the contingency tables generated from Li-rule
Rule 1
Length = 2
Odd Ratio = 11.9
Relative Risk = 1.1922
Service Frequency: Twice per week
High blood pressure
Cohort Size = 62, Percentage = 8.84%
Contingency Table
1
Xun Lu
3
46
Investigation of sub-patterns discovery and its applications
Pattern
Non-Pattern
62
536
598
0
103
103
62
639
701
Table 3.4
This table tells us that there are 62 people (out of 701 people) have the following
characteristics: (8.84% in the studied sample)

Twice per week

High blood pressure
This cohort of people is 11.9 times more likely to stay shorter than one year than other
people.
More discussion will be represented in next two chapters.
Xun Lu
47
Investigation of sub-patterns discovery and its applications
Chapter 4
Data description
The data to be used to test the aforementioned algorithms is from Domiciliary Care South
Australia.
Domiciliary Care SA is a state government organisation that provides services to people
with reduced ability to care for themselves, helping them to stay in their own homes,
living closely to their loved ones, family and local community. The services provided
from Domiciliary Care SA include physical assistance, rehabilitation and personal care,
as well as providing respite and support for carers. (Domiciliary Care SA website 2008)
Domiciliary Care SA is currently using ONI+ (Electronic telephone based assessment
tool) assessment to measure (or to predict) the length of the patients will stay.
The shorter the length of stay for a patient, the less benefit (or “riskier” in Data mining
terminology) Domiciliary Care would obtain. This is because the Domiciliary Care
cannot provide a thorough service for this client. However, such measures are not precise
enough for the purpose of categorising clients by risk. The project will assess the current
risk assessment measures, and find new effective indicators.
4.1
Data
Xun Lu
48
Investigation of sub-patterns discovery and its applications
The sample data consisting 783 set of records have been provided for investigation used.
The data were stored in Excel file. Each piece of record represent one client from
Domiciliary Care SA and each piece of record consists of at most 42 attributes. The
reason of different number of attributes consisted is because patients suffer from different
health conditions.
Data has been supplied for a 36-month period from 1st July 2005 to 30th June 2008 for
Domiciliary Care SA clients who were active and whose episodes have closed within this
period.
Length of stay was calculated using the Episode Open Date field and Episode Close Date
fields. Only the data that was current for the client when their episode was closed were
included, otherwise there would have been multiple lines of data for some clients.
Domiciliary Care SA currently categorises clients into five (5) streams – Basic, Package,
Rehabilitation, Dementia and Palliative. Palliative clients have been excluded from this
data.
Clients are assigned a coordination level depending on their needs. Each client can have
multiple Health Condition Codes, but they are not listed in the order of importance – these
have therefore been listed separately but in no specific order.
It is believed within the agency that the average length of stay of a client is approximately
3 years.
For the confidential reason, all client numbers have been de-identified and no names have
been provided to protect the identity of clients from Domiciliary Care SA.
4.2
Description of each data field
Xun Lu
49
Investigation of sub-patterns discovery and its applications
Appendix E is a detailed description of each data field. The field names in Field Name
column are listed in the order of exact data sequence from left to right. The Field Order Of
Priority column tells the importance of each data field. The higher priority deserves more
effort to investigate as these data fields may have greater influence on the final result.
For convenient data processing reason, a code or number is provided to substitute the
actual data. The codes/numbers are listed under Data code or type column and their
corresponding explanations are illustrated under Data description column.
Here are the snap shots of actual data:
Xun Lu
50
Investigation of sub-patterns discovery and its applications
The data shown above have been pre-processed. Age in second column is discrete value
so it needs to be categorised into different sections. There are four categories for age:
[20-76], [77-83], [84-88], [89-101]. The occurrences determine each of the age categories
and their span, i.e. they are evenly categorised.
Column SubAreas is generated from Suburbcode. Each Suburbcode represents a suburb
which belongs to one of four areas: North, West, South and East.
There are many columns with number codes. Each number code represents one particular
value. For example: Accommodation type has 22 different number codes, where code 1
represents Private residence – owned; code 2 represents Private residence – private rental
etc. details of each number representation is list on Appendix E.
The number code representation facilitates the algorithm to analysis and run.
The columns named in numbers represent diseases. The most common ten diseases have
been selected to investigate. The rest least common diseases have been ignored. For the
values of each disease, “Y” means this patient has such disease whereas “?” means this
patient does not suffer from such disease.
Xun Lu
51
Investigation of sub-patterns discovery and its applications
The last column EpisodeLength(yr) is the target attribute.it is assumed that the length of
stay with less than one year is considered to be short, greater or equal to three years is
considered to be long.
Xun Lu
52
Investigation of sub-patterns discovery and its applications
Chapter 5
Mining results discussion
5.1
Algorithm MORE
5.1.1
Data preparation for algorithm MORE
Data need to be transferred into correct formats before running the algorithm. For
algorithm MORE, it requires two files. One is called .data file (figure 5.1), as its name
specifies, it contains all the data in it. This is the file that is transformed directly from .xls
file to .csv file, with comma ‘,’ separating every column. Also we can see many question
marks in the .data file. The question mark means the value in that cell is missing, or
contains Null value. This is a very common situation in data mining research. Another file
is called .name file (figure 5.2) which contains all the attributes (columns) and all possible
values (domain) to be analysed.
It is worth to note that I put the target attribute EpisodeLength(yr) on the top with values 1
and 3 respectively representing less than one year and greater than three years, so that the
algorithm will know how the result to be generated.
We can also write ignore after an attribute name which we do not want algorithm to
analyse, for example, attribute clientNumber and attribute SuburbCode.
Xun Lu
53
Investigation of sub-patterns discovery and its applications
Figure 5.1
a snapshot of .data file for both MORE and Magnum Opus
Figure 5.2
Xun Lu
.name file for application Li-rule
54
Investigation of sub-patterns discovery and its applications
5.1.2
Result discussion for algorithm MORE
Application Li-rule which implements MORE can compute two patterns: Risk Pattern
and Preventive pattern.
If we regard the patients who stay less than 12 months in Domiciliary Care as risk target,
and we want to know the common character, we can let Li-rule uses Risk Patten to
compute. On the contrast, we regard the patients who stay longer than three years in
Domiciliary Care as safe target. Li-rule can employ Preventive Pattern mode to calculate
what these patients are in common.
MORE Test 1: with minimum support .0.1, maximum LHS size 4 and RHS Episode
Length = 3 (years)
Rule 1:
Explanation:
84 people have the following characteristics: (11.98% in the studied sample)

ServiceTypeProvidedToClientCode = 1 (Domestic Assistance)

ServiceFrequencyCode = 11 (Fortnightly)
This cohort of people is 4.14 times more likely to stay longer than three years than other
people.
Xun Lu
55
Investigation of sub-patterns discovery and its applications
Rule 2:
Explanation:
30 people have the following characteristics: (4.28% in the studied sample)



CoordinationLevel = 2 (Medium)
Stram = 2 (Packaged care)
ServiceFrequencyCode = 11 (Fortnightly)
This cohort of people is 6.63 times more likely to stay longer than three years than other
people.
Xun Lu
56
Investigation of sub-patterns discovery and its applications
Rule 3:
Explanation:
22 people have the following characteristics: (3.14% in the studied sample)




Ages ranging from 80 – 89
CarerAvailabilityCode = 2 (Has no carer)
CoordinationLevel = 3 (Low)
ServiceTypeProvidedToClientCode = 1 (Domestic Assistance)
This cohort of people is 6.38 times more likely to stay longer than three years than other
people.
Rule 4:
Xun Lu
57
Investigation of sub-patterns discovery and its applications
Explanation:
27 people have the following characteristics: (3.85% in the studied sample)

Ages ranging from 80 – 89

Stream = 1 (Basic)

ServiceTypeProvidedToClientCode = 1 (Domestic Assistance)
This cohort of people is 6.0 times more likely to stay longer than three years than other
people.
Rule 5:
Xun Lu
58
Investigation of sub-patterns discovery and its applications
Explanation:
28 people have the following characteristics: (3.99% in the studied sample)



ServiceTypeProvidedToClientCode = 1 (Domestic Assistance)
CarerAvailabilityCode = 2 (Has no carer)
AccomodationTypeCode = 3 (Private residence - public rental)
This cohort of people is 5.61 times more likely to stay longer than three years than other
people.
MORE Test 2: with minimum support .0.1, maximum LHS size 4 and RHS Episode
Length = 1 (years)
Rule 1
Explanation:
62 people have the following characteristics: (8.84% in the studied sample)
•
ServiceFrequencyCode = 5 (i.e. Twice per week)
•
High blood pressure (the code is 921).
This cohort of people is 11.9 times more likely to stay shorter than one year than other
people.
Xun Lu
59
Investigation of sub-patterns discovery and its applications
Rule 2
Explanation:
95 people have the following characteristic: (13.55% in the studied sample)

ServiceTypeProvidedToClientCode = 5 (i.e. Social Support)
This cohort of people is 19.0 times more likely to stay shorter than one year than other
people.
Rule 3
Explanation:
64 people have the following characteristics: (9.13% in the studied sample)
•
ServiceFrequencyCode = 5 (i.e. Twice per week)
Xun Lu
60
Investigation of sub-patterns discovery and its applications
•
High blood pressure (the code is 921).
This cohort of people is 11.9 times more likely to stay shorter than one year than other
people.
Rule 4:
Explanation:
63 people have the following characteristics: (8.99% in the studied sample)
•
CarerAvailabilityCode = 1 (i.e. Has carer)
•
Dementia - Alzheimer's disease 1 (Code is 500)
This cohort of people is 5.7 times more likely to stay shorter than one year than other
people.
Rule 5:
Xun Lu
61
Investigation of sub-patterns discovery and its applications
Explanation:
111 people have the following characteristics: (15.83% in the studied sample)
•
CoordinationLevel = 1 (i.e. High)
•
Stream = 201 (i.e. Dementia)
This cohort of people is 5.4 times more likely to stay shorter than one year than other
people.
I only selected the top five rules from both Risk pattern and Preventive pattern. However,
there are as many as over 100 rules (and sub-rules) can be generated from each pattern
mode. The presentation of these rules is listed by the RR (Relative Risk) with the highest
be the first of all rules.
5.2
Algorithm Opus
Application Magnum Opus (demo version) is applied to test the results generated by
algorithm Opus.
Xun Lu
62
Investigation of sub-patterns discovery and its applications
The demo version of Magnum Opus is sufficient for this testing as we only have 783
records to be analysed and demo version is able to handle up to 1000 records.
5.2.1
Data preparation for algorithm Opus
The process of data preparation for Opus is very similar to the ones employed in Li-Rule,
it requires two files as well, one is called .nam file, and another one is .data file. Figure 5.3
below illustrates .nam file which contains all the attributes that are to be analysed. The
difference from the .name file in Li-Rule is that here the target attributes (episode length)
need to be written down. Second difference is that it does not require a “.” at the end of
each attribute line. For the .data file is identical to the one used in Li-Rule.
Xun Lu
63
Investigation of sub-patterns discovery and its applications
Figure 5.3
.nam file for application Magnum Opus Demo
The following shows the steps to generate Opus result from using the same Domiciliary
Care data source.
5.2.2
Xun Lu
Result discussion for algorithm Opus
64
Investigation of sub-patterns discovery and its applications
After successfully loaded the .nam file and .data file into Magnum Opus, the user
interface will look like this:
By default, every attribute with every value is selected on both LHS and RHS. However,
this can be decided by user so that which attributes they want to appear on either side.
I will need to search for rules using different correlated measures and different minimum
support. For filter method, I chose Filter Out Insignificant because The Insignificant filter
is useful for discarding rules and itemsets that are very likely to be spurious [30].
Xun Lu
65
Investigation of sub-patterns discovery and its applications
Opus test 1: with minimum support .0.1, maximum size 4 and RHS Episode Length = 3
Xun Lu
66
Investigation of sub-patterns discovery and its applications
And the result for this one is:
We have 16 rules been found.
Xun Lu
67
Investigation of sub-patterns discovery and its applications
Opus test 2: with minimum support .0.05, maximum size 4 and RHS Episode Length = 3
Because the rules with minimum support 0.1 must be contained in the rules with
minimum support 0.05. So I have excluded the rules generated from minimum support
0.1. Details below:
Xun Lu
68
Investigation of sub-patterns discovery and its applications
Opus test 3: the result with minimum support .0.05, maximum size 4 and RHS Episode
Length = 1
Only 5 rules have been found.
It is surprised to note that most of the rules generated by Magnum Opus are general and
simple. Take Opus test 1 as an example, the first and second rules are stream=2  EL=3
and Gender=F  EP=3. Then the third rule is simple combination of the first two rules
stream=2 & Gender=F  EP=3. All these three results satisfy the criteria, however, this
is not users really want to see because it is so obvious to know that if rule three is true,
then it infers that rule one and two must be true as well and rule one and two need to be
pruned. Alternatively, Magnum Opus can modify its presentation in which rule one and
two can be placed as sub-rule of rule three.
Xun Lu
69
Investigation of sub-patterns discovery and its applications
If I changed the Search by from Support to Leverage, Lift, Strength or others, the result
will not be affected, apart from the order they are listed.
Xun Lu
70
Investigation of sub-patterns discovery and its applications
Chapter 6
Conclusion and future work
In this research, I have compared and discovered the difference between mining
algorithms STUCCO, Magnum Opus and MORE, and how these differences can benefit
each other. Three algorithms representing different mining techniques were selected to
compare. These three algorithms are STUCCO algorithm representing χ2 (chi-square);
Magnum Opus algorithm representing lift and leverage; and algorithm MORE
representing relative risk.
χ2 , lift and leverage were described in chapter 3.1.2; relative risk is explained in
chapter3.2.3.2; and all_cnfidence and cosine were described in appendix B.
The results generated from different mining algorithms greatly depend on what
correlation measures they use. As we can see from the discussion on the previous chapter,
the Magnum Opus result distinct from MORE results in many ways, no matter on the
number of rules generated or the quality and the interestingness of the rules.
For Magnum Opus, only few rules are generated and most of them are too general and
simple. On the other hand, most of rules generated by MORE have attributes which
cannot found in Magnum Opus. Even if I narrowed down the specific attributes appear in
a rule in MORE, Magnum Opus was unable to generate the same one.
Xun Lu
71
Investigation of sub-patterns discovery and its applications
Future work on improving MORE algorithm
The ideal steps to implement data analysis are that, first use all_confidence or cosine to
analyse, then inspect if the result are weakly positive or negative correlated. Next step is
to apply addition tests such as lift, leverage or chi-square to obtain a more precise result
[22].
What MORE can be beneficial from the above technique are: before starting MORE
computing, first, STUCCO can be ran first to get the optimal support threshold, which
increases the time and effort by guessing the proper support. But I failed to obtain a valid
application that runs on STUCCO algorithm in the research, so I can only assume what it
should be. Second, apply all_confidence or cosine to check whether the given dataset is
positively or negatively correlated. This step is particularly important if the dataset is big
and contains null-transactions.
When it comes to post-pruning stage, theoretically, the pruning strategy from OPUS
algorithm can be used in MORE because what MORE generates is also a tree structure
result represented by contingency tables. The results can be sorted into a specific order
that the nodes consisting uninteresting item should be placed first so that they can be
pruned at first stage. However, as it has been demonstrated in the three Opus tests in the
last chapter, Magnum Opus failed to show us the rules we want. The results contain a lot
of redundant and trivial rules, which is undesirable.
It comes to a conclusion that, in terms of the evaluating and analysing the medical data,
Li-Rule outperforms Magnum Opus, which means Relative Risk is the first option to be
employed to generate patterns for medical research.
This research can continue on by getting the valid version of STUCCO application to run
the given data from Domiciliary Care. Secondly, a more functional and comprehensive
version of Magnum Opus application may be helpful to perform a further investigation on
the data, which may bring a different results.
Xun Lu
72
Investigation of sub-patterns discovery and its applications
Xun Lu
73
Investigation of sub-patterns discovery and its applications
Appendix A – Annotated Bibliography
Adhikari, A & Rao, PR 2008, 'Mining conditional patterns in a database', Pattern
Recognition Letters, vol. 29, no. 10, pp. 1515-1523,
The authors, researchers at SP Chowgule College and Goa University, introduce the
notion of conditional pattern in a database. Conditional patterns are interesting and useful
for solving many problems. Normally, what I have been doing on data mining research is
bases on frequent itemsets. This paper mainly focuses on analysis on the characteristics of
a database in more detail by mining various patterns with respect to frequent itemsets by a
proposed algorithm and conditional patterns in database. In addition, it also studies some
kind of association among items which is not immediately available from frequent
itemsets and association rules. This is a very good attempt as some other interesting
patterns may be found in the hidden data itemsets.
Bastide, Y, Pasquier, N, Taouil, R, Stumme, G & Lakhal, L 2000, 'Mining Minimal
Non-redundant Association Rules Using Frequent Closed Itemsets', in Computational
Logic — CL 2000, pp. 972-986,
This paper mainly talks how they use the closure of the Galois connection, the two new
bases for association rules to identify the association rules. As there is no free copy of this
paper available on the internet, I cannot simply judge this paper is useful or not.
Bay, SD & Pazzani, MJ 2001, 'Detecting Group Differences: Mining Contrast Sets', Data
Mining and Knowledge Discovery, vol. 5, pp. 213-246,
This paper written by Bay, S. D. and Pazzani, M. J., the researchers from University of
California, tries to find out the contrast groups, i.e. the conjunctions of attributes and
values that differ meaningfully in their distribution across groups. This allows us to
Xun Lu
74
Investigation of sub-patterns discovery and its applications
answer the questions like: "How are History and Computer Science students different?"
and "What has changed from 1993 through 1998?".They use mining algorithm STUCCO
for discovery of contrast sets.
This paper is quite relevant to my research and it is worth reading in more detail.
Charles, EM 2006, 'Receiver Operating Characteristic Analysis: A Tool for the
Quantitative Evaluation of Observer Performance and Imaging Systems', vol. 3, no. 6, pp.
413-422,
The author, Charles, E. Metz, provide a detailed information about Receiver operating
characteristic (ROC) in this paper, and ROC is supposed to be very important knowledge
in data mining field. Unfortunately, this paper is not free on internet, but similar
information can be obtained from Wikipedia to compensate the loss.
Cortes, C & Pregibon, D 2001, 'Signature-Based Methods for Data Streams', Data Mining
and Knowledge Discovery, vol. 5, pp. 167-182,
Although this is a paper is of data mining technology in it, it does not really emphasize on
using or developing data mining tool or algorithm to solve problem, instead, it is more
about data stream and signatures. So I think this paper is irrelevant to my research.
Giudici, P & Castelo, R 2001, 'Association Models for Web Mining', Data Mining and
Knowledge Discovery, vol. 5, pp. 183-196,
This paper is more on graphical models using data web mining technology. It has little
information can be adopted from.
Xun Lu
75
Investigation of sub-patterns discovery and its applications
Giudici, P, Heckerman, D & Whittaker, J 2001, 'Statistical Models for Data Mining',
Data Mining and Knowledge Discovery, vol. 5, pp. 163-165,
This paper may be interesting and useful to my research topic as it is about statistic
modelling in data mining, but again, it is unfortunately unavailable on internet.
Han, J & Pei, J 2000, 'Mining frequent patterns by pattern-growth: methodology and
implications', SIGKDD Explor. Newsl., vol. 2, no. 2, pp. 14-20,
The authors, Jiawei Han and Jian Pei at Simon Fraser University, have introduced a new
methodology, frequent pattern growth , which mines frequent patterns without candidate
generation in large databases. The intended audience of this paper could be anyone that
are interested in mining frequent patterns by using other methods. This paper could be a
very good reference for comparing different mining algorithms. The result of this method
has superior efficiency than other methods, which is desirable, and thus this is a very
useful and a relevant paper for my research topic.
Han, J, Pei, J, Yin, Y & Mao, R 2004, 'Mining Frequent Patterns without Candidate
Generation: A Frequent-Pattern Tree Approach', Data Mining and Knowledge Discovery,
vol. 8, no. 1, pp. 53-87,
This paper is similar to "Mining frequent patterns by pattern-growth methodology and
implications". It introduces a new data mining candidate set generation-and-test approach
which overcomes the inefficient generation problem. It proposes a novel frequent-pattern
tree (FP-tree) structure for mining the complete set of frequent patterns by pattern growth.
This paper is very close to my research topic, so it is obvious relevant.
Hanley, J & McNeil, B 1982, 'The meaning and use of the area under a receiver operating
characteristic (ROC) curve', Radiology, vol. 143, no. 1, April 1, 1982, pp. 29-36,
Xun Lu
76
Investigation of sub-patterns discovery and its applications
This paper can be regarded as a compensation of "Receiver Operating Characteristic
Analysis: A Tool for the Quantitative Evaluation of Observer Performance and Imaging
Systems". It provides one of a very essential data mining knowledge. Although this may
not be closely related to my research, it is expected to be read by all audience who wish to
do the research on data mining field.
Hu, J & Mojsilovic, A 2007, 'High-utility pattern mining: A method for discovery of
high-utility item sets', Pattern Recognition, vol. 40, no. 11, pp. 3317-3324,
This paper is about an algorithm that works for frequent item set mining that identifies
high-utility item combinations. The difference between the traditional association rule
mining techniques and this algorithm is that this algorithm is to find segments of data,
defined through combinations of few items (rules), which satisfy certain conditions as a
group and maximize a predefined objective function. However, I don't think this is very
relevant to my research, but is still worth to go through.
Igor, K 2001, 'Machine learning for medical diagnosis: history, state of the art and
perspective', Artificial intelligence in medicine, vol. 23, no. 1, pp. 89-109,
The paper provides an overview of the development of intelligent data analysis in
medicine from a machine learning perspective: a historical view, a state-of-the-art view,
and a view on some future trends in this subfield of applied artificial intelligence. This is
not a paper that focus very specifically on one topic, however, it is a very good paper to
gain some basic knowledge about data mining, e.g. aive Bayesian classifier, neural
networks and decision trees. But there is not free copy of this paper available on internet,
so it may be disregarded.
Xun Lu
77
Investigation of sub-patterns discovery and its applications
Li, J & Wong, L 2003, 'Using Rules to Analyse Bio-medical Data: A Comparison
between C4.5 and PCL', in Advances in Web-Age Information Management, pp. 254-265
This paper provides a good comparison between two data mining algorithms, C4.5 and
PCL, which is a new algorithm that invented by authors, and it is supposed to overcome
the weaknesses from algorithm C4.5. This paper will be helpful to my research as I will
be using different kinds of algorithms to compare the efficiency.
Liu, B, Hsu, W & Ma, Y 1999, 'Mining association rules with multiple minimum
supports', paper presented at the Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining, San Diego, California, United
States,
This paper is very important and will be very helpful to my research. It discusses the issue
of setting a proper minimum support in data mining. The general problem called rare item
problem which means that if the minimum support is set too high, those rules may not be
found; if the minimum support is set too low, too many rules, including useless ones are
found. This paper proposes a novel technique allows the user to specify multiple
minimum supports to reflect the natures of the items and their varied frequencies in the
database.
Loekito, E & Bailey, J 2006, 'Fast mining of high dimensional expressive contrast
patterns using zero-suppressed binary decision diagrams', paper presented at the
Proceedings of the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining, Philadelphia, PA, USA,
This paper tends to solve the problem where the number of dimensions is large in
datasets. However, I don't think it has very close relation to my project research. As what
I am supposed to do is to analysis a generated data result and do not need to take too much
care on the result generating process.
Xun Lu
78
Investigation of sub-patterns discovery and its applications
Lucas, P 2004, 'Bayesian analysis, pattern analysis, and data mining in health care',
Current Opinion in Critical Care, vol. 10, no. 5, pp. 399-403,
From its abstract, this paper seems relevant to my research as it mentions the current role
of data mining and Bayesian methods in biomedicine and heath care, in particular critical
care. Meanwhile techniques from machine learning are being used to solve biomedical
and health-care problems. But unfortunately, it is not available on internet freely. .
Mitchell, T, Buchanan, B, DeJong, G, Dietterich, T, Rosenbloom, P & Waibel, A 1990,
'Machine Learning', Annual Review of Computer Science, vol. 4, no. 1, pp. 417-433,
Machine learning is a fundamental subject of data mining, and this journal article,
although it is not really specifically related to my research, it is worth to read through and
understand some important concepts in it. .
Mitchell, TM 1999, 'Machine learning and data mining', Commun. ACM, vol. 42, no. 11,
pp. 30-36,
This Journal Article is supposed to provide a general knowledge about machine learning,
and the relationship between machine learning and data mining. It has no any specific
topic to emphasize, but it is ideal for anyone wants to learn some fundamental knowledge
of data mining.
Novak, PK & Lavraˇc, N 2009, 'Supervised Descriptive Rule Discovery: A Unifying
Survey of Contrast Set, Emerging Pattern and Subgroup Mining', p. 27,
The authors of this paper Petra and Nada from Monash University contribute a new
understanding of data mining by resenting a unified terminology, by explaining the
Xun Lu
79
Investigation of sub-patterns discovery and its applications
apparent differences between the learning tasks as variants of a unique supervised
descriptive rule discovery task and by exploring the apparent differences between the
approaches. This paper is required for my research and is in need of careful reading and
understand the concepts such as contrast set and subgroup mining.
Ordonez, C 2006, 'Association rule discovery with the train and test approach for heart
disease prediction', Information Technology in Biomedicine, IEEE Transactions on, vol.
10, no. 2, pp. 334-343,
This paper written by Carlos Ordonez has introduced an algorithm that uses search
constrains to reduce the number of rules, searches for association rules on a training set,
and finally validates them on an independent test set. It uses Heart disease prediction as
an example throughout this document, which is really desirable to my research. The
method going to be used in my research can be compared to the algorithm being
mentioned here.
Ordonez, C 2006, 'Comparing association rules and decision trees for disease prediction',
paper presented at the Proceedings of the international workshop on Healthcare
information and knowledge management, Arlington, Virginia, USA,
The author of this paper, Carlos Ordonez from University of Houston, gives the
comparison between association rules and decision trees with respect to accuracy,
interpretability and applicability in the context of heart disease prediction. This paper is
closely related to my research topis. In addition, it uses simple language and example to
illustrate the constraints from different data mining techniques being used.
Ordonez, C, Ezquerra, N & Santana, C 2006, 'Constraining and summarizing association
rules in medical data', Knowledge and Information Systems, vol. 9, no. 3, pp. 1-2,
Xun Lu
80
Investigation of sub-patterns discovery and its applications
This is also very useful work in comparison with the previous ones as it focuses on
medical domain where data sets are generally high dimensional and small. It also talks
about the disadvantage about mining association rules in a high dimensional data set. This
paper is also closely related to the research I have been doing.
Quinlan, JR 1986, 'Induction of decision trees', Machine Learning, vol. 1, no. 1, pp.
81-106,
Although this paper does not focus on medical field, it summarises an approach to
synthesising decision trees that has been used in a variety of systems, and it describes one
such system, ID3, in detail. Results from recent studies show ways in which the
methodology can be modified to deal with information that is noisy and/or incomplete. It
is a useful material for knowing decision trees, which is a very important methodology in
data mining field.
Rindfleisch, T 1997, 'Privacy, information technology, and health care', Communications
of the ACM, vol. v40, no. n8, p. p92(9),
From the abstract of this paper, it is more about health care and information security
management, rather than data mining discussion. So this is not a desirable paper to read.
Turner, DA 1978, 'An Intuitive Approach to Receiver Operating Characteristic Curve
Analysis', J Nucl Med, vol. 19, no. 2, February 1, 1978, pp. 213-220,
This paper, written by David A Turner, has a very detailed explanation on the use of ROC
curves and also provides a comparison between ROC curves and sensitivity, specificity,
and percentage accuracy. It is a very good source to know about these concepts and will
also be useful on my research as well.
Xun Lu
81
Investigation of sub-patterns discovery and its applications
Viveros, MSN, J. P. Rothman, M. J. 1996, 'Applying Data Mining Techniques to a Health
Insurance Information System', PROCEEDINGS OF THE INTERNATIONAL
CONFERENCE ON VERY LARGE DATA BASES, no. 22, pp. 286-294
This journal article focuses on health insurance system, which may be related to my
research, but it has no free copy available on internet, so this paper has been disregarded.
Webb, GI 2000, 'Efficient search for association rules', paper presented at the
Proceedings of the sixth ACM SIGKDD international conference on Knowledge
discovery and data mining, Boston, Massachusetts, United States,
This paper talks about the efficiency of Apriori algorithm. Apriori can impose large
computational overheads when the number of frequent itemsets is very large. The author
has presented an algorithm for association rule analysis based on the efficient OPUS
search algorithm. This paper will be relatively helpful to my research.
Webb, GI & Zhang, S 2005, 'K-Optimal Rule Discovery', Data Mining and Knowledge
Discovery, vol. 10, no. 1, pp. 39-79,
This paper mainly focuses on the efficiency comparison between a wide range of
K-optimal rule discovery tasks. But it is not directly related to my research topic.
Zeng, Z, Wang, J & Zhou, L 2008, 'Efficient Mining of Minimal Distinguishing Subgraph
Patterns from Graph Databases', in Advances in Knowledge Discovery and Data Mining,
pp. 1062-1068,
This paper studies the problem of mining the complete set of minimal DGPs
(distinguishing graph patterns) with any number of positive graphs, arbitrary positive
Xun Lu
82
Investigation of sub-patterns discovery and its applications
support and negative support. This paper is required to be read through from my
supervisor.
Xun Lu
83
Investigation of sub-patterns discovery and its applications
Appendix B – Two other correlated measures:
all_confidence and cosine
In appendix B, we will discover two more useful correlated measures: all_conficence and
cosine.
 all_conficence
Assuming an itemset X contains k items, X = {i1, i2, …, ik}, according to Han & Kamber
2006, all_conficence is defined as:
all _ conf ( X ) 
sup( X )
sup( X )

max_ item _ sup( X ) max{sup( i j ) | i j  X }
where max{sup( i j ) | i j  X } , called the max_item_sup of the itemset X, is the item with
the highest support among all the items in itemset X. The all_confidence of X is the
minimal confidence among the set of rules ij(X- ij), where (X- ij) represents an itemset X
without the item ij in it.
 cosine
Given two itemsets A and B, the correlated measures cosine of A and B is defined as
follow:
cos ine ( A, B) 
Xun Lu
P( A  B)
sup( A  B)

P( A)  P( B)
sup( A)  sup( B)
84
Investigation of sub-patterns discovery and its applications
The formula of cosine is similar to the formula of lift, except there is an important
difference that cosine has a square root of P(A)×P(B) where lift does not. The effect of the
square root √ ensures the cosine value is not influenced by the total number of
transactions, i.e. the total number of itemsets in the whole database to be analysed (Han &
Kamber, 2006).
Xun Lu
85
Investigation of sub-patterns discovery and its applications
Appendix C – OPUS Algorithms
Algorithm: OPUS_AR(CurrentLHS, AvailableLHS, AvailableRHS)
com CurrentLHS is the set of condition in the LHS of the rule
currently being considered.
com AvailableLHS is the set of conditions that may be added
to the LHS of rules to be explore below this point.
com AvailableRHS is the set of conditions that may appear on
the RHS of a rule in the search space at this point and
below.
1. SoFar := {}
2. FOR EACH P in AvailableLHS
(a)
NewLHS := CurrentLHS  {P}
(b)
AvailableLHS := AvailableLHS – P
(c)
IF pruning rules cannot determine that
x  AvailableLHS:  y  AvailalbleRHS:
 credible(x  NewLHS  y) THEN
i.
NewAvailableRHS = AvailableRHS
ii.
FOR EACH Q in AvailableRHS
A. IF credible(NewLHS  Q) THEN
record NewLHS  Q
B. IF Pruning rules determine that
 x  AvailableLHS: X={} 
 credible(x  NewLHS  Q) THEN
NewAvailableRHS := NewAvailableRHS-Q
iii. IF NewAvailableRHS ≠ {} THEN OPUS_AR(NewLHS,
SoFar, NewAvailableRHS)
iv.
Xun Lu
SoFar := SoFar  {P}
86
Investigation of sub-patterns discovery and its applications
Appendix D – Background knowledge relating
algorithm MORE
Algorithm MORE
Input: data set D, the minimum support  (set by user) in
abnormal class a, and the minimum relative risk threshold  .
Output: optimal risk pattern set R
Global data structure: l-pattern set for 1≤l (An l-pattern
contains l attribute-value pairs)
1. Set R=Ø
2. Count support of 1-patterns in the abnormal class
3. Generate(1-pattern set)
4. Select risk patterns and add them to R
5. New pattern set  Generate(2-pattern set)
6. While new pattern set is not empty
7. Count supports of candidates in new pattern set
8. Prune(new pattern set)
9. Add patterns with relative risk greater than  to R
10.
Prune remaining superfluous patterns in R
11.
New pattern set  Generate(next level pattern set)
12.
Return R
Where words in bold represent functions:
Function Generate((l+1)-pattern set)
1. Let (l+1)-pattern set be empty set
2. For each pair of patterns Sl-1p and sl-1Q in l-pattern set
3. Inset candidate Sl-1pg in (l+1)-pattern set
4. For all Sl  Sl-1pg
5. If Sl does not exist in l-pattern set
6. Then remove candidate Sl-1pg
7. Return (l+1)-pattern set
Function Prune((l+1)-pattern set)
Xun Lu
87
Investigation of sub-patterns discovery and its applications
1. For each pattern S in (l+1)-pattern set
2. If supp(Sa)/supp(a) ≤  then remoce pattern S
3. Else if there is a sub-pattern S’ in l-pattern set such
that supp(S’)=supp(S) or supp(S’¬a)=supp(S¬a)
4. Then remove pattern S
5. Return
Proof of lemma
Let PQx be a proper super pattern of PQ. PQx = Px and PQ = P when Q = Ø. To prove this
lemma, we need to show that RR(PQx  a) ≤ RR(PQ  a).
RR(PQ  a) =
sup p( PQa ) sup p(( PQ))
sup p(( PQ)a ) sup p( PQ)
=
conf ( PQ  a )
conf ( PQx  a )

conf (( PQ)  a ) conf (( PQ)  a )
(1)
RR(PQ  a) ≥
conf ( PQ  a )
=RR(PQx  a)
conf (( PQ )  a )
(2)
It can be deduced that supp(PQ¬a)=sup(PQx¬a) for any Q from supp(P¬a)=supp(Px¬a).
Proof of step (1):
Consider f(y) = (y/y+α) monotonically increases with y when constant α>0 and
supp(PQ)≥supp(PQx)>0
conf(PQ  a) =
=
sup p( PQa )
sup p( PQa )
=
sup p( PQa )  sup p( PQa )
sup p( PQ)
sup p( PQa )
sup p( PQxa)
≥
sup p( PQa )  sup p( PQxa ) sup p( PQxa)  sup p( PQxa )
= conf(PQxa)
Proof of step (2):
From sup(PQ¬a)=sup(PQx¬a) we know that sup((PQ) ¬x¬a)=0.
conf( ( PQx )  a )=
Xun Lu
sup( ( PQx )a ) sup( ( PQ))  sup( ( PQx )a )
=
sup( ( PQx ))
sup( ( PQx ))
88
Investigation of sub-patterns discovery and its applications
=
sup( ( PQ))  sup( ( PQ) xa )  sup( ( PQ)xa )
sup( ( PQx ))
=
sup( ( PQx ))  sup( ( PQ)a )
(because sup( ( PQ)xa ) =0)
sup( ( PQx ))
≥
sup( ( PQ))  sup( ( PQ)a )
sup( ( PQ))
=
sup( ( PQ)a )
sup( ( PQx ))
= conf( ( PQ)  a )
From this lemma, if the support of a pattern (or an itemset) {a, b, ¬c} is equal to the
support of a pattern (or an itemset) {a, ¬c}, we can then be certain that all its super sets
(e.g. {a, b, ¬c}, {a, d, ¬c}, {a, c, d, ¬c}) will not be considered and should be pruned (Li
etc, 2008).
Xun Lu
89
Investigation of sub-patterns discovery and its applications
Appendix E – Data Field Description
Field
Order of
Priority
Field Name
Field Description
Data code or type
Client Number
Unique client
identifier
Age in years when
episode was closed
6 digit number
3
Age
6
7
Gender
Suburb
1
Episode Open Date
2
Episode Close Date
12
Pension Type
5
Carer Availability
Does the client have
a carer available?
4
Carer Residency
16
Carer Relationship
Does this carer reside
with the client?
If has no carer (as
per carer availability)
field will be blank
What is the
relationship to the
client of this carer?
17
Usual Living
Arrangements
Xun Lu
Suburb where the
client resides
Date that this current
episode opened
Date that this current
episode closed
Type of pension this
client receives
Usual living
arrangements of the
client
Data description
Number
Text
Text
Male/Female
dd/mm/yyyy
Dates to be used to calculate
length of stay
dd/mm/yyyy
Pension Type Code
(Number)
1
2
2.1
2.2
2.3
3
4
5
6.1
6.2
6.3
6.4
6.5
6.6
6.7
7
7.1
7.2
99
Carer Availability
Code
1
2
9
0
Carer Residency
Code
1
2
Pension Type Description
(Text)
Aged
DVA
DVA - Gold
DVA – White
DVA – Spouse
Disability
Carer
Unemployment Related
Widows
Sole Parent
Sickness
Workers Compensation
Part Pension
Blind Pension
Health Care Card
No government pension
Superannuation
Overseas pension
Not stated
Carer Availability Description
Has carer
Has no carer
Not stated
Not applicable
Carer Relationship
Code
1
2
3
4
5
6
7
8
9
10
11
12
13
99
Usual Living
Arrangement Code
1
Carer Relationship Description
Wife/female partner
Husband/male partner
Mother
Father
Daughter
Son
Daughter-in-law
Son-in-law
Other relative/female
Other relative/male
Friend/Neighbour – Female
Friend/Neighbour – Male
Private employee
Not stated
Carer Residency Description
Co-resident carer
Non-resident carer
Usual Living Arrangement
Description
90
Investigation of sub-patterns discovery and its applications
2
3
9
0
8
Fee Waiver Status
9
Fee Waiver Reason
11
Fee Waiver Effective
Date
Fee Waiver Expiry
Date
Coordination level
13
Coordination level
start date
Coordination level end
date
Stream
14
Stream start date
Stream end date
Services Provided to
Client
15
Service start date
Service end date
Service Frequency
18
Accommodation Type
If client is eligible for
a waiver
The reason a
financial waiver has
been granted
N
Y
Waiver Reason Code
1
2
3
5
6
7
8
9
dd/mm/yyyy
dd/mm/yyyy
Coordination level
that this client
requires
1 – High
2 – Medium
3 - Low
dd/mm/yyyy
dd/mm/yyyy
How Domiciliary Care
SA has streamed this
client
01 – Basic
02 – Packaged Care
0201 – Dementia
0202 - Rehabilitation
dd/mm/yyyy
dd/mm/yyyy
Personal Care
Domestic Assistance
Respite
Social Support
Recovery Support
Dementia Day
Program
dd/mm/yyyy
dd/mm/yyyy
Service Frequency
Code
1
2
3
4
5
6
7
8
9
10
11
13
14
15
98
Accommodation
Type Code
1
2
3
4
5
6
Xun Lu
Lives alone
Lives with family
Lives with others
Not stated
Not applicable
No Waiver
Waiver
Waiver Reason Description
Financial Waiver - ongoing
High Risk
Program exemption
Deceased/Closed
Financial (RDNS)
Financial (Short term)
Financial (Health Cover)
No Waiver
Date the waiver was effective
from
Date the waiver was effective to
Service Frequency Description
1/day x 5 days per week
2/day x 5 days per week
3/day x 5 days per week
Once per week
Twice per week
3 times per week
1/day x 7 days per week
2/day x 7 days per week
3/day x 7 days per week
4 times per week
Fortnightly
Once off visit
Full day
Once ever 4 weeks
Other
Accommodation Type
Description
Private residence - owned
Private residence – private rental
Private residence – public rental
Private residence – mobile home
Independent living unit within
retirement village
Boarding house
91
Investigation of sub-patterns discovery and its applications
7
8
9
10
11
12
13
14
15
16
17
18
19
21
99
10
Xun Lu
Health Condition
Codes
Text
Emerg/transitional
accommodation
Supported living facility
Supported accommodation
facility
Residential aged care facility
Mental health care facility
Public place/temporary shelter
Private residence rented from
the Aboriginal community
Temporary Aboriginal shelter
Hospital
Extended care/rehab facility
Palliative care facility/hospice
Not applicable
Other
Private residence – family
member
Not stated
See separate listing
92
Investigation of sub-patterns discovery and its applications
References
[1]
Ramamohanarao, K Bailey, J & Fan, H 2005, Efficient Mining of Contrast
Patterns and Their Applications to Classification, IEEE Computer Society.
[2]
Bay, SD & Pazzani, MJ 2001, 'Detecting Group Differences: Mining Contrast
Sets', Data Mining and Knowledge Discovery, vol. 5, pp. 213-246.
[3]
Bailey, J & Dong G 2007, Contrast Data Mining: Methods and Applications,
IEEE ICDM 28-31
[4]
Webb, GI, Butler, S & Newlands, D 2003, On detecting differences between
groups, ACM, Washington, D.C.
[5]
Dong, G & Li, J 1999, Efficient mining of emerging patterns: discovering trends
and differences, ACM, San Diego, California, United States.
[6]
Dong, G Zhang, X Wong, L and Li, J. CAEP: Classification by Aggregating
Emerging Patterns. Technical report, March 1999.
[7]
Webb, GI 2000, Efficient search for association rules, ACM, Boston,
Massachusetts, United States.
[8]
Quinlan, J.R C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA, 1993.
[9]
Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version
2.0. [http://www.ipums. umn.edu/].
[10]
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between
sets of items in massive database. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, pp. 207–216.
Xun Lu
93
Investigation of sub-patterns discovery and its applications
[11]
Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of
contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth
Annual Conference of the Cognitive Science Society, p. 750.
[12]
Webb, GI. 1995. OPUS: An efficient admissible algorithm for unordered search.
Journal of Artificial Intelligence Research, 3:431-465.
[13]
Ordonez, C Omiecinski, E Levien de Braal, Cesar Santana, and Ezquerra, N.
Mining constrained association rules to predict heart disease. In IEEE ICDM Conference,
pages 433–440, 2001.
[14]
Domiciliary Care SA 2008, South Australia view 3 November 2008,
< http://www.domcare.sa.gov.au/ >
[15]
Li, J Fu, A and Fahey, P Efficient Discovery of Risk Patterns in Medical Data,
Artificial Intelligence in Medicine, 45, 77-89, 2009.
[16]
Petrus, K and Li, J and Fahey, P. Comparing Decision Tree and Optimal Risk
Pattern Mining for Analysing Emergency Ultra Short Stay Unit, 2007
[17]
Ordonez, C 2006, 'Association rule discovery with the train and test approach for
heart disease prediction', Information Technology in Biomedicine, IEEE Transactions on,
vol. 10, no. 2, pp. 334-343.
[18]
Roberto J, Bayardo J, Agrawal R. Mining the most interesting rules. In: Fayyad U,
Chaudhuri S, Madigan D, editors. Proceedings of the fifth ACM SIGKDD international
conference on knowledge discovery and data mining. New York: ACM; 1999. p. 145 154.
[19]
Webb GI, Zhang S. K-optimal rule discovery. Data Mining and Knowledge
Discovery Journal 2005;1:39 - 79.
Xun Lu
94
Investigation of sub-patterns discovery and its applications
[20]
Zaki MJ. Mining non-redundant association rules. Data Mining and Knowledge
Discovery Journal 2004; 3: 223-48
[21]
Domiciliary Care SA 2008, South Australia view 3 November 2008,
< http://www.domcare.sa.gov.au/ >
[22]
Han, J & Kamber M 2006 “Data Mining: Concepts and Techniques” Elsevier Inc,
San Francisco.USA
[23]
Information Management 2002 United States viewed11 March 2010,
http://www.information-management.com/news/5329-1.html
[24]
Leverage 2009 Australia viewed 13 March 2010,
http://www.giwebb.com/Doc/MOleverage.html
[25]
Bayardo, R.J., Agrawal, R, and Gunopuls, D. 1999. Constraint-based rule mining
in large, dense databases. In Proceeding 15th International Conference on Data
Engineering
[26]
Rymon, R 1992. Search through systematic set enumeration. Third International
Conference on Principles of Knowledge Representation and Reasoning.
[27]
P-Value 2010 Wikipedia viewed 18 March 2010,
http://en.wikipedia.org/wiki/P-value
[28]
Triola MM, Triola MF. Biostatistics for the biological and health sciences, 2nd
edition. Boston, 2004
[29]
Wikipedia 2008, Odds ratio view 4 November 2009
< http://en.wikipedia.org/wiki/Odds_ratio >
Xun Lu
95
Investigation of sub-patterns discovery and its applications
[30]
Association Discovery with Magnum Opus 4.3Australia viewed 13 Jun 2010,
<http://www.giwebb.com/tutorialfr.html>
[31]
Wikipedia 2010, Attribute-value pair viewed 1 August 2010
<http://en.wikipedia.org/wiki/Attribute-value_pair>
Xun Lu
96