Download DOC Version

Investigation of sub-patterns discovery and its applications Xun Lu 1 Investigation of sub-patterns discovery and its applications School of Computer and Information Science (Honours) Applying pattern discovery methods to a healthcare data Academic Supervisor: Associate Prof Jiuyong Li Student: Xun Lu ID: 100047661 Email: [email protected] Xun Lu 2 Investigation of sub-patterns discovery and its applications Disclaimer I declare the following to be my own work, unless otherwise referenced, as defined by the University’s policy on plagiarism. Xun Lu Xun Lu 3 Investigation of sub-patterns discovery and its applications Abstract Data mining is one of the most exciting information science technologies in 21 century. It has become an important mechanism that is able to interpret the information hidden in data to human-understandable knowledge. It has been heavily involved in a wide range of profiling practices, these include finance, marketing, bioinformatics, genetic and medicine study, etc. The data to be studied, in terms of their properties and relations, can vary greatly form relational data, sequential data, graphs, models to classifiers, or the combinations of these. Different data mining methods and algorithms can be adopted to analyse different forms of data presentation so that the results are assured to be interpretable and understandable. Contrast patterns mining, more generally speaking, contrast groups mining or contrast sets mining, is one of the most challenging and vital techniques in data mining research. Patterns, or groups, are collections of items which satisfy certain properties which are of interesting information [1]. In other words, patterns represent different classes of objects, for example, American male and Russian male, or the income changes in 2004 through 2009. Contrast patterns are the conjunctions of attributes ad values that distinguish meaningfully in their distribution across groups [2]. Contrast patterns of various kinds differ greatly, for example, Pattern and rule based contrasts, Data cube contrasts, Sequence based contrasts, Graph based and Xun Lu 4 Investigation of sub-patterns discovery and its applications Model based contrasts. However, there is no one specific paper or research lays the emphasis on comparing the similarities and differences between them. This research, therefore, is intended to make a clear and comprehensive comparison of different contrast patterns techniques. It firstly provides background knowledge, which gives a grounding in data mining; then annotations on relevant literature is shown along with the summary of deficiency in different algorithms being implemented in various contrast sets. The thesis also provides a critical survey of existing contrast patterns discovery methods. One of the major data sources used in the research is from Domiciliary Care SA, a government organization which takes care of disables and elderly people. Different algorithms discussed in this thesis will employ the same data sauce from Domiciliary Care SA to ensure that the results generated are comparable. A detailed description of the data is presented on Chapter 4. Key words: data mining, contrast patterns, association rules, risk patterns, subgroup discovery. Xun Lu 5 Investigation of sub-patterns discovery and its applications Contents Chapter 1 Introduction 10 1.1 Background 10 1.2 Motivation 10 1.3 Research Questions 12 1.4 Thesis Plan 13 Chapter 2 Literature survey 14 2.1 Background 14 2.2 Related work 22 Chapter 3 Methodology 23 3.1 Distinguishing different data mining techniques 23 3.1.1 What is correlation analysis, and why do we need it? 23 3.1.2 What are the correlation measures? 25 3.1.2.1 Chi-square χ2 25 3.1.2.2 Lift 27 3.1.2.3 Leverage 29 3.2 Distinguishing different data mining algorithms 3.2.1 STUCCO Xun Lu 31 31 3.2.1.1 What is STUCCO 31 3.2.1.2 What technique does it use and how does it work 32 3.2.1.3 Advantages 33 6 Investigation of sub-patterns discovery and its applications 3.2.1.4 How STUCCO determines significant contrast sets 34 3.2.1.5 How does STUCCO do the pruning 34 3.2.1.5.1 Effect size pruning 34 3.2.1.5.2 Interesting based pruning 35 3.2.2 Magnum Opus 36 3.2.2.1 What is Magnum Opus 36 3.2.2.2 What technique does it use 37 3.2.2.3 Pruning technique from OPUS algorithm 38 3.2.3 MORE 42 3.2.3.1 What is MORE 42 3.2.3.2 Risk Patterns 43 3.2.3.3 What technique does MORE use? 44 3.2.3.4 Advantages 45 3.2.3.5 Results generated from MORE 45 Chapter 4 Data description 48 4.1 Data 48 4.2 Description of each data field 49 Chapter 5 Mining results discussion 53 5.1 Algorithm MORE 53 5.1.1 Data preparation for algorithm MORE 53 5.1.2 Result discussion for algorithm MORE 55 5.2 Algorithm Opus 62 5.2.1 Data preparation for algorithm Opus 63 5.2.2 Result discussion for algorithm Opus 64 Chapter 6 Conclusion and future work 71 Appendix A – Annotated Bibliography 74 Xun Lu 7 Investigation of sub-patterns discovery and its applications Appendix B – Two other correlated measures: all_confidence and cosine 84 Appendix C – OPUS Algorithms 86 Appendix D – Background knowledge relating algorithm MORE 87 Appendix E – Data Field Description 90 References 93 Xun Lu 8 Investigation of sub-patterns discovery and its applications List of Figures and Tables 1 Figure 2.1 Comparing UCI application over 1993-1998 19 2 Figure 2.2 Association rules for Bachelor and PhD degree holders 20 3 Figure 3.1 Illustration of how leverage of a rule is computed 30 4 Figure 3.2 Attribute-value pairs in a set-enumeration trees structure 32 5 Figure 3.3 A top-down pruning on tree structure 39 6 Figure 3.4 Half-way through the pruning process 40 7 Figure 3.5 The outcome of ordinary pruning 40 8 Figure 3.6 A re-orderd tree structure for OPUS 41 9 Figure 3.7 The outcome of OPUS pruning 42 10 Figure 5.1 a snapshot of .data file for both MORE and Magnum Opus 55 11 Figure 5.2 .name file for application Li-rule 56 12 Table 2.1 Mushroon table 17 13 Table 3.1 2×2 contingency table 24 14 Table 3.2 2×2 contingency table (with expected values) 27 15 Table 3.3 RR can be memorised easily from a contingency table 44 16 Table 4.1 data source from Domiciliary Care SA 50 Xun Lu 9 Investigation of sub-patterns discovery and its applications Chapter 1 Introduction 1.1 Background Data mining is one of the most exciting information science technologies in twenty-first century. It has become an important mechanism that is able to interpret the information hidden in data to human-understandable knowledge. It has been heavily involved in a wide range of profiling practices, these include finance, marketing, bioinformatics, genetic, engineering and medicine study, etc. The data to be studied, according to their properties and relations, can vary greatly form relational data, sequential data to graphs, models, classifiers, or the combinations of these. Different data mining methods and algorithms can be adopted to analyse different forms of data presentation so that the results are assured to be interpretable and understandable. 1.2 Motivation Why is contrast patterns research being undertaken so intensively? A comment from a US cartoon Get Fuzzy by Darby Conley (2001) may shed some light on this question: Xun Lu 10 Investigation of sub-patterns discovery and its applications Sometimes it is good to contrast what you like with something else. It makes you appreciate it even more. A general example is: if someone is only looking for a highest mountain on the world, even though he has successfully found one, e.g. Mount Everest, he could probably not able to get too much insight into it. However, if it is compared with other mountains, such as the Alps, he may possibly get more understandings to the tallest mountain he is researching on. A more specialised example can be: by focusing the sales changes in 1998 through 2008 in just one department may not be more informative and telling more details than by comparing the sales figures over the same period with two or more similar departments. The two examples above are telling us that by comparing/contrasting with other objects, more information can be discovered. However, contrast patterns of various kinds differ greatly, for example, Pattern and rule based contrasts, Data cube contrasts, Sequence based contrasts, Graph based and Model based contrasts. Different contrast patterns mining method aims to solve different data presentation, e.g. relational data, sequential data to graphs, models, classifiers, or the combinations of these. Different methods and algorithms are involved in solving different types of contrast pattern. Currently, [3] classify the contrast patterns into categories shown as follow:  Pattern/Rule based contrasts o Border differential algorithm o Tree based algorithm o Projection based algorithm o ZBDD based algorithm  Contrast for rare class datasets o Synthesising  Data cube contrasts o Basic OLAP operations o LiveSet-Driven Algorithm  Sequence based contrasts o g-MDS Mining Problem Xun Lu 11 Investigation of sub-patterns discovery and its applications o ConSGapMiner algorithm  Graph based contrasts  Model based contrasts o Rand index and Jaccard index o Mutual information o Clustering error Under each first level dot point, the second level dot points list some of the methods or algorithms which are related to solving the corresponding contrast patterns. There are, however, still many more methods and algorithms haven’t been listed here. So this may easily cause the confusions of using them. Given one contrast mining problem, what methods are available? What are the pros and cons to using a certain method? Are there any better improvements of the methods? This research, therefore, is intended to make a clear and comprehensive comparison of different contrast patterns techniques and provide improvements if possible. This leads to the research questions of the thesis. 1.3 Research Questions This thesis aims to address the following research question: How to discover contrast patterns efficiently in large data sets? This research question can be broken down into three sub-questions which enable the problem to be more precise and manageable. 1. What methods are available? 2. What are their strengths and weaknesses? 3. How to improve the methods? Xun Lu 12 Investigation of sub-patterns discovery and its applications 1.4 Thesis Plan The thesis contains the following sections. Section One provides some basic knowledge, which gives a grounding in data mining. Section Two annotates on relevant literature. Section Three provides a critical evaluation of existing contrast patterns discovery methods and explains the methods that will be adopted to answer the research questions. The Section Four, the final section, discusses the test results and concludes the thesis. Xun Lu 13 Investigation of sub-patterns discovery and its applications Chapter 2 Literature survey This section gives a basic knowledge and an overview of the early works that are related to the research topic. 2.1 Basic Knowledge The basic knowledge for this research is given by various source of literature survey. Two basic concepts of data mining: Support and Confidence. They are two measurements of association rule. Support represents the usefulness and Confidence reflects the certainty of discovered rules [22]. For example, digital camera  memory card (support 2%, confidence 54%) means that digital camera and memory card being purchased together in one transaction occupies 2% of all transactions (support); 54% of confidence means that 54% of customers who buy digital camera also buy memory card. Many data mining algorithms can be involved in discovering patterns, such as Association rules and Decision trees, Clustering, Regression, etc. Association rule is one of the most popular techniques in data mining [10]. They are in forms of “If…Then…” format [16]. Take an association rule “Bread  Milk” as an Xun Lu 14 Investigation of sub-patterns discovery and its applications example, “IF a customer purchases bread, THEN this customer is likely to purchase milk”. Association rules exhaustively look for hidden patterns, making them suitable for discovering predictive rules involving subsets of the medical data set attributes [13]. This, on the other hand, is a problem of association rules [15]. It tends to discover too many rules that are trivial and repetitive particularly when the minimum support is set low. Some other methods such as Mining non-redundant association rules [20] Association rule discovery with the Train and Test Approach [17], mining most interesting rules [18] and Mining k-optimal rules [19] etc. tend to overcome this problem, however, they still suffer from other deficiencies, for example, they are difficult to understand by end users [15]. The fundamental task of data analysis is to understand the meaning and the differences between contrasting groups. This led to the development of a new special purpose data mining technique, Contrast-set mining [4]. Contrast patterns mining, more generally speaking, contrast groups mining or contrast sets mining, is one of the most challenging and vital techniques in data mining research. Definition 1: Patterns (or Itemsets) are collections of items which satisfy certain properties which are of interesting information [1]. In other words, patterns represent different classes of objects. Example 1: {computer, digital camera, memory card}; In Example 1, computer, digital camera and memory card are three items. This three-item transaction from a computer store is a pattern. In other words, each single transaction can be regarded as a pattern. Before giving the definition of the Contrast patterns, it is necessary to define what an attribute-value pair is. Attribute-value pair is a fundamental data representation in which each item record may be expressed as a collection of tuples <attribute name, value>; each element is an attribute-value pair. [31] Xun Lu 15 Investigation of sub-patterns discovery and its applications Definition 2: Contrast Patterns are the conjunctions of attribute-value pairs that distinguish meaningfully in their distribution across groups [2]. Examples 2.1: MaritalStatus  divorced  Age  [60  69] Examples 2.2: Degree  Ph.D  Gender  female  Income  $10,000 In Example 2.1, the contrast pattern contains two attribute-value pairs, i.e. people who are divorced and people who are aged between 60 and 69. And the conjunction of these two pairs forms a contrast pattern. In Example 2.2, the contrast pattern consists of three attribute-value pairs, i.e. people with PhD degree title, female and their income are over 10 thousand dollars. Anyone who falls into this contrast pattern satisfies the attributes defined within the contrast pattern, i.e. a female PhD holder whose income is above 10k dollars. Contrasting specific groups of interest plays a key role in social science research. Bay & Pazzani (2001, p.213) aims to automatically detect all the differences between contrasting groups from observational multivariate data. Based on Bay and Pazzani (2001, p.217), the data is a set of groups G1, G2, …, Gl where each group is a collection of objects O1, O2, …, On where each object Oi is a set of k attribute-value pairs, one of each of the attributes A1, A2, … , Ak. Attribute Aj has values drawn from the value domain set Vj1 , Vj2, …, Vjm. They search contrast sets, the conjunctions of attributes and values that have different levels of support in different groups. According to [4], a contrast set is a set of attribute-value pairs with no attribute Ai occurring more than once. This is equivalent to an itemset in association-rule discovery when applied to attribute-value data. [4] also state that contrast set discovery is to find all contrast sets whose support differs meaningfully across groups. This is defined as seeking all contrasts sets cset that meet the both requirements: Eq.1: ijP(cset | Gi )  P(cset | G j ) Xun Lu 16 Investigation of sub-patterns discovery and its applications and Eq.2: max | supp (cset , Gi )  supp (cset , G j ) |   ij where  is a user-defined threshold called the minimum support-difference. If contrast sets meet the first requirement (Eq.1), this contrast sets are called significant. If the second requirement (Eq.2) is met, then contrast sets are called large. It is also important to note that contrast set must differ meaningfully across groups. While Eq.1 provides the basis of a statistical test of ‘meaningful’, Eq.2 provides a quantitative test thereof. If both requirements are met, then such contrast set is all deviation. Eq.1 means that for any group i, which is a contrast set (cset), must differ from any other group j that is also a contrast set (cset). In other words, the contrast set represents a true difference between groups. In Eq.2, (cset, Gi) means i is a group which is also a contrast set, similarly, (cset, Gj) means group j is also a contrast set. The difference of the probability of (cset, Gi) and (cset, Gj) must be bigger than a pre-defined minimum support-difference, which is denoted as . This equation ensures that every contrast pattern that is discovered has big enough effect to be considered important by users [2]. According to [1], the recently proposed patterns for knowledge discovery from databases are called Emerging Patterns (EPs). EPs, which capture significant changes and differences between datasets, are defined as itemsets whose supports increase significantly from one dataset, D1, to another, D2 [5]. More importantly, EPs are itemsets whose growth rates are larger than a given threshold . The growth rates here mean the ratios of their supports in D2 over that in D1 [5]. EPs can capture useful contrasts between the classes, for example male vs. female, poisonous vs. edible. The following is a typical example [3] which shows what the growth rate is and how EP captures the significant change in class X to predict that a mushroom is edible or not: X = {(Odour = none), (Stalk_Surface = smooth), (Ring_Number = one)} EP Supp of Poisonous Supp of Edible Growth Rate X 0.2% 57.6% 288 Table.2.1: Mushroom table Xun Lu 17 Investigation of sub-patterns discovery and its applications It can be seen from the table, the support rate increases drastically from just 0.2% poisonous to 57.6% edible with the growth rate of 288. We can predict that this mushroom is 99.6% edible. 99.6% = 57.6% ÷ (57.6% + 0.2%). Those EPs with very large growth rates are notable differentiating characteristics between the edible and poisonous Mushrooms, and they have been useful for building powerful classifiers [5]. The advantages of EPs are that it can be easily understood and used directly by people, and EPs have been wildly used for predicting the likelihood of diseases. There are two different types of groups in data mining: (Bay and Pazzani, 2001 p.214) one is based on time with groups falling into different years, which forms a trend. For instance, observations spaced through time with one observation per time point, as examples are shown in Figure 2.1; In contrast, another type is based on multiple observations at a few discrete points in time. For example, we could have thousands of World Cup soccer match data records in year 1998, 2002 and 2006. Figure 2.1 (a) Xun Lu 18 Investigation of sub-patterns discovery and its applications Figure 2.1 (b) Fig.2.1: Comparing UCI application over 1993-1998. (a) Admitted ICS student with SAT Math > 700  SAT Verbal >700; (b) UCI applicants with Admit = Yes  Home Location = Los Angeles County  School Type = Public. (Bay & Pazzani 2001 p.214) Other mining approaches such as Decision Tree and Rule Learner* can also be used to discover patterns. Despite the advantage of being fast, [2] have comments on their weaknesses:  Rule learners and decision trees are not complete;  Rule learners and decision trees may miss groups which are important;  The interpretation of a rule may be difficult if all previous rules were not satisfied;  It is difficult to specify useful criteria in the classification framework. However, they fail to provide more details on why they have these disadvantages, which leaves a very good opportunity for my research to undertake a further investigation on them. * Decision Tree, Rule Learner and some other typical patterns mining methods will be presented in more detail in Chapter 2.2 Related works. Xun Lu 19 Investigation of sub-patterns discovery and its applications Association rule mining [10] is closely related to contrast set discovery. Association rules are relations between variables of the form X  Y, where X and Y can represent any items such as beer and diaper respectively, means that people bought beer also bought diaper. X and Y can also represent category data such as degree=PhD or income ≥$10,000. Then X  Y means people who are PhD degree holders earn over $10,000. According to [2], the techniques involved in finding association rules and mining contrast sets are of many commonalities because they both require search through a space of conjunctions of items or attribute-value pairs. In association rule mining, we look for sets that have support greater than a certain threshold and for contrast sets, on the other hand, we look for those sets which represent substantial differences in the underlying probability distributions. Bay and Pazzani therefore tried to directly apply association rule mining algorithms to find contrast sets but it turned out that the results are really hard to interpret and moreover, there may be a chance to lose the pruning opportunities which can significantly improve the mining efficiency. Figure 2.2 (a) Figure 2.2 (b) Fig.2.2: Association rules for Bachelor and PhD degree holders. Rules are in the form X  Y (support, confidence). (a) First 10 of 26796 association rules for Bachelor holders; (b) First 10 of 1674 association rules for PhD holders. [1] As can be seen from Fig.2.2, it is really difficult to draw a conclusion about the relationship Bachelor and PhD degree holders because:  There are too many rules to compare;  The result are difficult to interpret as they are not consistent contrast* [11]; Xun Lu 20 Investigation of sub-patterns discovery and its applications  A proper statistical comparison is needed to see if differences in support and confidence are significant. * Consistent contrast means the attribute to compared in separate groups are the same To handle these problems, [2] present a method called STUCCO (Search and Testing for Understandable Consistent Contrasts) which runs efficiently and can mine at low support differences without being overwhelmed with large itemsets. [4] tried to prove that contrast-set mining is a special case of the more general rule-discovery task by comparing three alternative data mining techniques. They are STUCCO, Magnum Opus and C4.5. The reason that these three methods are selected is because, for STUCCO, it is the only data mining approach designed for identifying contrasts between groups; for Magnum Opus and C4.5, they are considered to be suited to performing the type of contrast analysis. Magnum Opus is a general purpose rule-discovery system which implements the OPUS_AR rule-discovery algorithm [4]. It does not require the specification of a minimum-support contrast because it does not use the frequent-itemset strategy. However it has association-rule functionality. C4.5 [8], on the other hand, discovers classification rules by first discovering a decision tree, then converting that tree to an equivalent set of rules, then simplifying those rules. After the observation of the results run by three data mining techniques (STUCCO, Magnum and C4.5) independently, [4] found that Magnum Opus has produced rules corresponding to all contrast set found by STUCCO. Contrast set discovery and general rule discovery constrained to use only the group identifier as a consequent both seek to identify equivalent situations. They only differ in the way how they assess whether a group difference in meaningful. In addition, according to [4], the main distinction between STUCCO and Magnum Opus is in the application of filters that seek to identify and remove spurious contrast sets. Magnum Opus uses a binomial sign test while STUCCO uses a chi-square test which is believed is a better test for the contrast-discovery task because it is more sensitive to a small range of extreme forms of contrast. Xun Lu 21 Investigation of sub-patterns discovery and its applications One of the tasks of this research is to answer how to efficiently find out the patterns (i.e. association rules, A (antecedent)  C (consequent)) in data set. There are many ways of improving the efficiency of association rules discovery algorithms. [7] argues that some applications direct search for association rules can be more efficient than Apriori algorithms. A more detailed description is shown on chapter 3. 2.2 Related work A related work that I have been doing from year 2008 is investigating the risk patterns from a data set provided from Domiciliary Care SA. Domiciliary Care SA is a state government organisation that provides services to people with reduced ability to care for themselves, helping them to stay in their own homes, living closely to their loved ones, family and local community [21]. This work was about analysing the results (risk patterns) generated from Lirule application. The application uses MORE (Mining Optimal Risk pattErn sets) Algorithm [15]. What makes the MORE Algorithm stands out from other association rule mining algorithms is that it makes use of the anti-monotone property to efficiently prune the search space. An anti-monotone property of frequent itemsets is defined here: An itemsets is frequent if its support is greater than the minimum support. An itemset is potentially frequent only if all its subsets are frequent [15]. This property limits the number of itemsets to be searched and, consequently, improves the efficiency of the algorithm. Chapter 5 will present and discuss the results generated by application Lirule. Xun Lu 22 Investigation of sub-patterns discovery and its applications Chapter 3 Methodology 3.1 Distinguishing different data mining techniques Before comparing the difference of the following techniques, it is essential to understand the meanings and characteristics of the measurement rules, i.e. the correlation measure between them. 3.1.1 What is correlation analysis, and why do we need it? Using association rule AB [support, confidence] measurement is a very basic and fundamental approach to determine whether the rule is interesting or not to users. On the other hand, because of its basis of measurement, there is a possibility that the results are misleading and incorrect. To overcome this, rather using association rule AB [support, confidence], we add one more measurement, correlation, into it, which becomes correlation rules of the form AB [support, confidence, correlation]. A number of correlation measures are wildly used, such as lift, χ2 (chi-square test), all_conf and cosine etc. Xun Lu 23 Investigation of sub-patterns discovery and its applications The following 2×2 contingency table is an example showing why using support and confidence alone does not suffice to make a sound determination. Considering the data below: Milk ¬Milk Σrow Sugar 160 140 300 ¬Sugar 80 20 100 Σcolumn 240 160 400 Table 3.1 In the table, “Milk” represents the number of transactions that contain milk whereas “¬Milk” represents the number of transactions that does not contain milk. Same reading applies to “Sugar”. So, among the total 400 transactions, we have 160 transactions that contain both Milk and Sugar and 140 transactions that contain Sugar but no Milk, Similarly, 80 transactions contain Milk but no Sugar and 20 transactions that contain neither Milk nor Sugar. According to association rule AB [support, confidence], we have MilkSugar [40%, 66.7%], which is calculated from support = 160 160  40% , confidence =  66.7% . 400 240 Providing the giving minimum support threshold and minimum confidence threshold are 35% and 65% respectively. The results we got show that not only does the rule MilkSugar satisfy the minimum support and confidence, it also seems very interesting and significant to users. However, if we take a closer look at the data in the table, it is not difficult to find out the percentage of purchasing Sugar P(Sugar) is 300  75% , which is 400 much higher than purchasing Milk and Sugar together (66.7%). This means, purchasing Milk and Sugar does not increase the possibility of purchasing Sugar. Milk in fact negatively affects the likelihood of buying Sugar. The example above depicts that the confidence of a rule AB can be misleading in that it is only an estimation of the conditional probability of itemset B given itemset A [22]. It Xun Lu 24 Investigation of sub-patterns discovery and its applications does not reflect the actual relations between itemset A and B. This is the reason why Correlation analysis is introduced from Association analysis. 3.1.2 What are the correlation measures? Some popular correlation measures are χ2 (chi-square), lift, all_conf and cosine etc. The following illustrates how each rule calculates and determines if the given rule is interesting or not. And at the end, their advantages and disadvantages will be discussed. 3.1.2.1 Chi-square χ2 Chi-square is used for calculating the correlation relationship between two attributes A and B with categorical data [22]. The results show whether attributes A and B are independent or not. It is computed as: 2 ( o  e )  2    ij ij i 1 j 1 eij c where r , (Equation 1) oij is the actual count of the joint event (A , B ) and eij is the expected i j frequency of (Ai, Bj), which is computed from: count ( A  ai )  count ( B  b j ) eij  , (Equation 2) N where N is the total number of data, count(A = ai) is the number of rows which have value ai for A, similarly, count(B = bj) is the number of rows which have value bj for B. Xun Lu 25 Investigation of sub-patterns discovery and its applications The result of χ2 equation is the sum of all the r rows × c columns in the contingency table. To illustrate how χ2 works, we again use a same 2×2 contingency table: Milk ¬Milk Σrow Sugar 160 140 300 ¬Sugar 80 20 100 Σcolumn 240 160 400 In the table, we have two attributes: Milk and Sugar. The actual frequencies arer shown in every grid of table. Now we need to calculate the expected frequency of every joint event (Ai, Bj). According to Equation 2, the expected frequency for grid (Milk, Sugar) is: eMilk, Sugar  count ( Milk )  count ( Sugar ) 240  300   180 ; N 400 The expected frequency for grid (¬Milk, Sugar) is: eMilk,Sugar  count (Milk )  count ( Sugar ) 160  300   120 ; N 400 The expected frequency for grid (Milk, ¬Sugar) is: eMilk,Sugar  count ( Milk )  count (Sugar ) 240  100   60 ; N 400 And lastly, the expected frequency for grid (¬Milk, ¬Sugar) is: eMilk,Sugar  Xun Lu count (Milk )  count (Sugar ) 160  100   40 ; N 400 26 Investigation of sub-patterns discovery and its applications Milk ¬Milk Σrow Sugar 160 (180) 140 (120) 300 ¬Sugar 80 (60) 20 (40) 100 Σcolumn 240 160 400 Table 3.2 Every expected value is written in the brackets accordingly in the table. It should be noted that the sum of expected frequency is equal to the sum of actual count, i.e. the observed frequency. Now we are able to calculate the Chi-Square based on Equation 1 provided above: (160  180) 2 (140  120) 2 (80  60) 2 (20  40) 2      180 120 60 40 400 400 400 400      2.22  3.33  6.67  10  22.22 180 120 60 40 2 For this 2×2 table, the degrees of freedom are (2-1) × (2-1) =1. For 1dregree of freedom, the χ2 values needed to reject the hypothesis at the 0.001 significance level is 10.828. [22]. This means, if the chi-square value we calculated is higher than 10.828, which means attributes Milk and Sugar are not independent, in other words, they are correlated. 3.1.2.2 Lift “Lift” is probably the most commonly used metric to measure the performance of targeting models in marketing applications [23]. Compared to Chi-square, lift is simple correlation measure that does not require much calculation to tell whether two attributes are correlated (dependent) or not. It is computed as: lift ( A, B)  Xun Lu P( A  B) P( A) P( B) (Equation 3) 27 Investigation of sub-patterns discovery and its applications Lift is to measure whether the given two attributes, A and B, are independent or not. At this stage, we are only concerned whether if value of lift is equal to 1, greater than 1 or less than 1. ① When lift (A, B) =1: i.e. P( A  B)  P( A) P( B) , in other words, the probability of A and B together is equal to the probability of A times the probability of B, then they are independent; ② When in the second situation, lift (A, B) >1: i.e. P( A  B)  P( A) P( B) , we say A and B are positive correlated (dependent), which means the occurrence of one implies the occurrence of the other; ③ When in the third situation, lift (A, B) <1: i.e. P( A  B)  P( A) P( B) , we say A and B are negatively correlated (dependent), which means the occurrence of one implies the occurrence of the other. Both situations ② and ③ tell that attributes A and B are dependent, so what is the difference between them and how do positive correlated and negatively correlated differ? To answer this question and overcome the problem discussed section 1.1, let us use the example Milk and Sugar example again. 160 P( Milk  sugar ) 0.4 400 lift ( Milk , Sugar )     0.89 P( Milk ) P( Sugar ) 240  300 0.45 400 400 The computed value lift 0.89 is less than 1, which tells that Milk and Sugar are negatively correlated. What it actually means is that, customers who buy Milk in one transaction are less likely to buy Sugar. Or customers who buy Sugar in one transaction are less likely to buy Milk. Because Milk and Sugar negatively influence each other, they are unlikely to happen together. Xun Lu 28 Investigation of sub-patterns discovery and its applications We can also get the same explanation by looking at the fraction 0 .4 . Value 0.4 0.45 represents the possibility of milk and sugar being purchased together whereas value 0.45 represents the likelihood would have been if two purchase were completely independent [22]. This clearly explains why, although the confident of Milk and Sugar conf(milk, sugar) is high (66.7%), it does not necessarily mean that the association rule MilkSugar is strong, or the possibility of the purchase of sugar would be increased because of the purchase of milk. This also depicts that the results generated by using support and confidence could be misleading or even incorrect. 3.1.2.3 Leverage Leverage is a correlation rule which is used in algorithm Magnum Opus. By definition [24], “The leverage of a rule is the number of additional cases covered by both the LHS (Left-Hand-Side) and RHS (Right-Hand-Side) above those expected if the LHS and RHS were independent of each other. This is a measure of the importance of the rule that reflects both the strength and the coverage of the rule.” Here is an example to illustrate how leverage of a rule A(LHS)B(RHS) is computed: Xun Lu 29 Investigation of sub-patterns discovery and its applications LHS RHS 50 200 100 Total 1000 Figure 3.1 Consider the fact that there are 1000 items (i.e. cases) in total, item A (LHS) has 200 cases, and item B (RHS) has 100 cases. The occurrences of items A and B ( A  B ) are 50. Obviously, the proportion of A  B is 50/1000=5%. This is the proportion from raw count. The expected proportion of A  B is calculated from the proportion of A times the proportion of B: 200 100   2% , which is 20 (2%  1000=20) cases, providing item A 1000 1000 is independent from item B. Hence: Leverage count is: 50  20  30 Leverage proportion is: 30 1000  0.03 As the definition of the leverage implies, the result computed by leverage indicates the importance of the rule. The higher value of the leverage, the more important the rule will be. This is quite straightforward. In the previous example, the leverage count of rule A  B is 30, this means item A and item B are correlated; otherwise it would not have been any difference from raw count value to expected value, i.e. leverage count is zero, which means item A and item B are independent. Xun Lu 30 Investigation of sub-patterns discovery and its applications There are other useful correlation measures, such as all_confidence and cosine. However, these measures have not been applied in the data mining algorithms discussed in the thesis. The detailed comparison between them and the three above is given in Appendix B. 3.2 Distinguishing different data mining algorithms Most of the time, once all the contrast sets that satisfy the significance and large requirements are found, they may not present the real needs to users. This is because most of contrast sets found present no interest to users. It is an open challenge in data mining to decide if a given contrast set is interesting. We need to apply certain techniques to handle this problem, this is generally called post-pruning technique. This chapter focuses on three algorithms which use different method to post-prune the uninteresting contrast sets. These three algorithms are STUCCO from Stephen D Bay and Michael J Pazzani; Magnum Opus from Geoffrey Webb; and MORE from Jiuyong Li 3.2.1 STUCCO 3.2.1.1 What is STUCCO STUCCO is an acronym for Search and Testing for Understandable Consistent Contrasts. It is actually a pruning rule incorporates with heuristic functions to limit the complexity of searching contrast sets in a huge search space [2]. Xun Lu 31 Investigation of sub-patterns discovery and its applications For example, in some cases, if the given support is set to very low, this will end up a huge number of qualified contrast sets, which increase the difficulty and slows the searching speed of the mining algorithms. However, in practice, STUCCO is able to handle this efficiently. This will be discussed in detailed in section 3.2.1.2. 3.2.1.2 What technique does it use and how does it work Let us examine attribute-value pairs represented in a set-enumeration trees structure [25, 26] first. The domain of the attributes is {milk, tea, sugar, coffee}. Figure 3.2 We can regard each item embraced in a curly brackets represents an item being purchased in one transaction. For example, node {tea, sugar, coffee} means these three items happened to appear in one of the transactions from a customer. STUCCO organises the nodes with the same parent into one group [2]. For example, node {tea, sugar} and node {tea, coffee} at the third level from the top belong to one group, because they have the same parent, also called heading, {tea}. Similarly, node {milk, tea, sugar} and node {milk, tea, coffee} belong to one group with the heading {milk, tea} which, in turn, share the same heading {milk} with other two nodes {milk, sugar} and {milk, coffee}. Xun Lu 32 Investigation of sub-patterns discovery and its applications Within each group, STUCCO also organise two lists of items [2]: the head h(g) and the tail t(g). The head h(g) is what we just demonstrated, represents the same prefix in one group. The tail t(g) is all the items apart from the prefix h(g). For example, let us consider group {milk, tea}, {milk, sugar}, {milk, coffee}. The h(g) is {milk} and the tail t(g) is {tea, sugar, coffee}. 3.2.1.3 Advantages As aforementioned, the purpose of STUCCO organising items in each group into two lists, i.e. the head h(g) and the tail t(g), is to improve the efficiency of searching the interesting itemsets. First the headings: Given any itemset, simply check if its prefix matches any of the predefined headings. If it does not, we can immediately judge that this itemset does not meet our requirements and we do not need to further check each combination of items in the itemset. For example, if we have already known that the itemset, for example, {milk, sugar} does not meet our requirement. When given another itemset {milk, sugar, coffee} we can ensure that this itemset is not interesting, because it matches the headings {milk, sugar}, which has been defined as not interesting itemset. In other words, STUCCO only examines the itemsets that match the interesting prefix, ignores those do not, so that the searching speed is improved in this way. The second advantage of keeping the heading and tailing is to define the upper bound and lower bound of the support. This is a very important and useful technique used in pruning method. The upper bound and lower bound of the support is computed by counting every support of h(g)  i where i is an itemset which belongs to h(g) and h(g)  t(g) [2]. The upper bound of the support means the maximum value of the given itemset which cannot be any greater. We know that any support of itemset is less than or equal to the support of its subset. For example, the support of {sugar, coffee} is 60%, then we can be sure that the Xun Lu 33 Investigation of sub-patterns discovery and its applications support of its super set {sugar, coffee, milk} cannot be greater than 60%. SUTCCO is counting every support of the itemset that is in the group of a same heading. 3.2.1.4 How STUCCO determines significant contrast sets Based on the definition on [27], the p-value is a statistical term that represents the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The lower the p-value, the less likely the null hypothesis will be true, the more "significant" the result will be. If the p-value is less than 0.05, the null hypothesis will usually be rejected. STUCCO applies the chi-square to test the independence of variables in contingency tables. Then it uses χ2 value in conjunction with p-value to compare the α (the value of maximum probability of rejecting the null hypothesis when it is true. It is usually set to 0.05 for a single test). If the p-value is less than 0.05, then the contrast set is independent (i.e. it is significant). 3.2.1.5 How does STUCCO do the pruning STUCCO applies three techniques to prune those contrast sets which do not meet the requirements and interesting criteria. The three techniques are Effect size pruning, Statistical significance pruning and Interesting based pruning. In the following sections, I will concentrate on discussing Effect size pruning and Interesting based pruning. 3.2.1.5.1 Xun Lu Effect size pruning 34 Investigation of sub-patterns discovery and its applications The effect size pruning technique optimises the support. STUCCO prunes those contrast sets with the upper bound support below δ. A support of a contrast set is generally decided by experienced users. However, the quality of the generated result will be reduced if the support is not properly set. For example, if the support is set too high, many of the interesting contrast sets will be missing; whereas, if the support is set too low, the results may contain too much trivial information which is of no value to users at all. Moreover, too many uninteresting contrast sets will increase the searching and pruning time, which in turn will decrease the efficiency. The following is the theorem [2] which STUCCO bases on to determine the optimal support rate: Theorem 1. Let U[i] be an upper bound and let L[j] be the lower bound of support for groups i and j. Then the following is an upper bound on the support difference between any two groups.  max  max U [i]  L[ j ] ij ,i  j This equation takes every pair of groups into consideration and takes the maximum δ as an upper bound [2]. Any contrast set with its support lower than δ will be pruned. 3.2.1.5.2 Interesting based pruning Some contrast sets pass the minimum support threshold, but they are not interesting to users, because they present no useful or new information to users. For example, if a contrast set Mobile_Bill_ Expense > $50  Gender = Male meets the requirement and is interesting to user, another specialised contrast set Mobile_Bill_Expense > $50  Gender = Male  has_Mobile = Yes also meet the requirement and is interesting to user. But the later one does not provide more Xun Lu 35 Investigation of sub-patterns discovery and its applications information than the first one because when we know the first contrast set, it also implies that people who with the mobile phone bill must have mobile phone. To conclude Interesting based pruning, STUCCO has two conditions to determine if a contrast set should be pruned or not [2]: iP (cset  True | Gi )  P(cset '  True | Gi ) (Equation 4) max | sup port (cset , Gi )  sup port (cset ' , Gi ) |  s (Equation 5) i Where cset’ is a contrast set that is a specialization of contrast set cset. Using the example above, Mobile_Bill_Expense > $50  Gender = Male  has_Mobile = Yes is a specialization contra set of Mobile_Bill_ Expense > $50  Gender = Male which is denoted as cset. It is obvious that if the support of cset’ is the same as the support of cset, then cset’ does not provide any new information, simply is it the same meaning to cset, therefore, if iP (cset  True | Gi )  P(cset '  True | Gi ) , then this specialization contrast set should be pruned. In equation 5,  s is normally set to 1% (Bay & Pazzani 2001). This means, if the maximum support difference between a contrast set and its specialized contrast set is not greater than  s , than this specialized contrast set should also need to be pruned. 3.2.2 3.2.2.1 Xun Lu Magnum Opus What is Magnum Opus 36 Investigation of sub-patterns discovery and its applications Magnum Opus is a commercial application which applies OPUS_AR rule discovery algorithm [4]. OPUS_AR is an extended version of OPUS search algorithm. Magnum Opus uses OPUS systematic search approach to perform association rule search. What OPUS_AR differs from OPUS is that OPUS_AR does not limit the RHS of a rule to be single attribute-value pair [4]. However, this improved character from OPUS_AR. will not be concerned too much in this thesis. What will be discussed is the searching and pruning principles which are adopted by both OPUS_AR and OPUS. 3.2.2.2 What technique does it use The measurements of rule that Magnum OPUS used are support, confidence (strength), lift, coverage and leverage, however, the default measure used by Magnum is Leverage [4]. Leverage measures the magnitude of the difference between the actual (observed) accounts of frequency and the expected counts of frequency. The equation as follow: Leverage(a  c)  sup p(a  {c})  sup p(a )  sup p({c}) Where supp (a  {c}) represents the actual count of the rule a  c and supp (a )  supp ({c}) is the expected value. Magnum OPUS prunes all the rules that do not satisfy the following condition: x  a P(c | a )  P(c | x ) Xun Lu 37 Investigation of sub-patterns discovery and its applications where a is a generation (specialized contrast set) of x. So the above condition means for all contrast set x that belongs to contrast set a, if the proportion of the consequence from a is not greater than the proportion of consequence from x, then this contrast set a should be pruned. For example: x(1, 2) a1(1, 2, 4) a2(1, 2, 6) The consequence c is, for example, set to (5, 9), which means we now have x c i.e. (1, 2)  (5, 9) a1  c i.e. (1, 2, 4)  (5, 9) and a2  c i.e. (1, 2, 6)  (5, 9) If the proportion of x  c is 60% and the proportion of a1  c is 60% as well, then the rule a1  c should be pruned because this does not add any extra useful information from its parent rule as its support is not greater than the support of its parent node. 3.2.2.3 Pruning technique from OPUS algorithm In this section, I will focus on discussing how OPUS algorithm increases its searching speed by improving its pruning method. Let’s use an example to illustrate this pruning technique: using the example in section 3.2.1.2, we have milk, tea, sugar and coffee four elements in search space. According to the grocery transaction data, the following traction records were made: {milk}, Xun Lu 38 Investigation of sub-patterns discovery and its applications {tea}, {sugar}, {coffee}, {milk, tea}, {milk, sugar}, {milk, coffee}, {tea, sugar}, {tea, coffee}, {sugar, coffee}, {milk, tea, sugar}, {milk, tea, coffee}, {milk, sugar, coffee}, {tea, sugar, coffee}, {milk, tea, sugar, coffee} All these transaction records can be rearranged into a tree structure which facilitates the pruning process. The structure is usually fixed, so called fixed-structure. The fixed-structure tree is like this: Figure 3.3 The figure 3.3 above shows the fixed- structure tree that ordinary search algorithm will follow the top-down order to perform the searching and pruning. For example, if an algorithm is searching an itemset {sugar, coffee} which is not significant (is not interesting to user) and should be pruned. Xun Lu 39 Investigation of sub-patterns discovery and its applications Figure 3.4 In general, rule discovery algorithm knows that if the support of a given itemset below a predefined threshold, then all the supersets of the given itemset cannot contain the solutions. Therefore they should also be pruned. In this example, any itemset that contains {sugar} (except single item itself, i.e. itemset {sugar} will be kept) will be eliminated. These itemsets are: {milk, tea, sugar}, {milk, tea, sugar, coffee}, {milk, sugar}, {milk, sugar, coffee}, {tea, sugar} and {tea, sugar, coffee}. Ultimately, after the pruning, the result will look like below: Figure 3.5 Xun Lu 40 Investigation of sub-patterns discovery and its applications The purpose of this pruning is to ensure that the next step rule discovery to be as simple as possible. As can be easily noticed, the efficiency is very low by doing in this way, not until every node has been searched (coloured in red) and the target is reached, can only one node then be pruned. To achieve the result in figure 3.5, the ordinary algorithm needs to go through the fixed-structure tree seven times. This is because every search only eliminates one node which contains no result, we have seven nodes that contain element {sugar}, so seven top-down searches are needed. The search space of the example only contains four conditions (milk, tea, coffee and sugar) to be analysed. When the conditions reach ten thousands, the search space will increase exponentially to 210000 [4]. The search space with this huge size is not possible to be pruned. Therefore, an improved searching algorithm is needed. The following is showing the strategy of how OPUS progresses the traditional searching sequence by changing the fixed-structure tree into a reordered-structure tree, in order to reduce the number of search. The idea is simple: to reorder the search space so that any node to be pruned will be placed precedes all other ones not to be pruned. Figure 3.6 Xun Lu 41 Investigation of sub-patterns discovery and its applications As the figure 3.6 showing above, the biggest different is that the tree structure has been restructured so that all the nodes containing unwanted element {sugar} have higher priority than the ones without. They are placed before others. The advantage of this is when the pruning algorithm search through the tree to prune the uninteresting node, only one search will achieve the same result of seven searches from fixed-structure tree. The efficiency has been greatly increased in this way. The outcome form figure 3.7 is identical to the outcome from figure 3.5. Figure 3.7 The detail of OPUS search algorithm is at Appendix C. 3.2.3 MORE 3.2.3.1 What is MORE MORE stands for Mining Optimal Risk pattern Sets. Algorithm MORE discovers risk patterns in data, especially in medical data [15]. Xun Lu 42 Investigation of sub-patterns discovery and its applications Definition of Risk Patterns: Risk Patterns are patterns whose local support and relative risk are higher than the user specified minimum local support and relative risk threshold, respectively. [15]. 3.2.3.2 Risk Patterns The definition of pattern is similar to the definition of contrast set: A pattern is a set of attribute-value pairs. For example: {Nationality: Japanese, Gender: Male, Height: [165cm-169cm] , Weight: 80kg} is a pattern (contrast set) with four attribute-value pairs. In the study of risk patterns, we have a new concept called target attribute which is of two values: abnormal and non-abnormal [15]. In medical term, target attribute abnormal means given a set of attribute-value pairs, this patient is regarded or classified as risk, e.g. this patient is more likely to get a certain disease. On the other hand, target attribute non-abnormal, i.e. normal, means this person is healthy and low risk. Take the attribute-value pairs above as an example again, for people who belong to this P = {Nationality: Japanese, Gender: Female, Height: [165cm-169cm], Weight: [70kg-79kg]}, we regard the probability of getting heart disease is high, denoted as (P  a), where target attribute a stands for abnormal. The value relative risk (RR) measures the likelihood of abnormal from certain attributes compared with the cohort without these certain attributes. The calculation of RR is defined as: RR ( P  a ) = prob ( P, a ) prob ( P) prob(a | P) = prob (P, a ) prob (P) prob(a | P) = sup( Pa) sup( P) sup( Pa) sup( P) = sup( Pa) sup( P ) sup( Pa) sup( P ) This formula can be memorised easily from a contingency table: Xun Lu 43 Investigation of sub-patterns discovery and its applications Abnormal (a) Non-abnormal (n) Total P ① prob(P, a) prob(P, n) ② prob(P) ¬P ③ prob(¬P, a) prob(¬P, n) ④ prob(¬P) Total prob(a) prob(n) 1 Table 3.3 RR ( P  a ) = (①×④) ÷ (③×②). The cohort with pattern P is classified abnormal if the value of RR is greater than 1; the cohort with pattern P is classified non-abnormal if the value of RR is smaller than 1[28]. 3.2.3.3 What technique does MORE use? The principle that MORE algorithm applies is simple yet efficient. According to Li etc. 2008, ordinary methods need to firstly go through the whole data sets to find out all frequent patterns. (A frequent pattern is a pattern which support is higher than the minimum support). Then use relative risk to form rules to replace confidence. Lastly, post-pruning is applied to prune the huge amount of frequent but uninteresting rules. When the minimum support is set low, a large number of rules will be generated, which in turn will take much longer time to undertake the pruning process. How MORE algorithm distinguishes other ordinary rule mining algorithms and overcomes this issue is making use of the anti-monotone property. Anti-monotone property: if {a, b} is infrequent, then all its supersets of {a, b} are infrequent. MORE algorithm is based on a lemma and a corollary to efficiently mine optimal risk pattern sets. Xun Lu 44 Investigation of sub-patterns discovery and its applications Lemma: if (supp(Px¬a)=supp(P¬a)), then pattern Px and all its supper patterns do not occur in the optimal risk pattern set. (Li etc, 2008). Corollary: if (supp(Px)=supp(P)) then pattern Px and all its super patterns do not occur in the optimal risk pattern set. (Li etc, 2008). The detail of MORE algorithm and the proof of the lemma are in Appendix D. 3.2.3.4 Advantages According to Li et. 2008, Most association rule mining methods are facing three challenges: (1) those association rule mining approaches do not suit medical data analysis. This is because most mining algorithms rely upon confidence and lift. But MORE algorithm uses Relative Risk. (2) truly interesting rules are overwhelmed by too many uninteresting or trivial rules (3) when the minimum support is set low, the efficiency of an association mining algorithm becomes very low. 3.2.3.5 Results generated from MORE The outcome of MORE is contingency tables (patterns) represented in tree structure. Each table represents a cohort of people with the same pattern. Xun Lu 45 Investigation of sub-patterns discovery and its applications A recent research uses MORE to analyse data from Domiciliary Care SA. After running the Lirule program (a program uses MORE algorithm to generate results), there are over 60 risk patterns generated, but not all of these patterns are useful or practical. The patterns with very low Odds Ratio (OR) will not be considered. The odds ratio is a measure of effect size particularly important in Bayesian statistics and logistic regression [29]. Low odds ratio means this pattern is not significant enough to distinguish from the ordinary patients. A risk pattern with odds ratio 4.34, for example, means that the patients with these specific attributes are 4.34 times more likely to be staying short comparing to those without such attributes combination. So, high odds ratio deserves more investigation. In addition, some patterns with similar attributes combination will only be selected one of them, or select the subset of these similar patterns. The reason for this is to differentiate between patterns as much as possible. The more distinguish between the patterns, the more informative result will be provided. A risk pattern with the minimum support 5%, for example, means that 5% of all the attributes combinations under study show that such attributes combination in this risk pattern happen together. Similarly, a risk pattern with the minimum support 10% means that 10% of all the attributes combinations under study show that such attributes combination in this risk pattern happen together. For example, the table below is one of the contingency tables generated from Li-rule Rule 1 Length = 2 Odd Ratio = 11.9 Relative Risk = 1.1922 Service Frequency: Twice per week High blood pressure Cohort Size = 62, Percentage = 8.84% Contingency Table 1 Xun Lu 3 46 Investigation of sub-patterns discovery and its applications Pattern Non-Pattern 62 536 598 0 103 103 62 639 701 Table 3.4 This table tells us that there are 62 people (out of 701 people) have the following characteristics: (8.84% in the studied sample)  Twice per week  High blood pressure This cohort of people is 11.9 times more likely to stay shorter than one year than other people. More discussion will be represented in next two chapters. Xun Lu 47 Investigation of sub-patterns discovery and its applications Chapter 4 Data description The data to be used to test the aforementioned algorithms is from Domiciliary Care South Australia. Domiciliary Care SA is a state government organisation that provides services to people with reduced ability to care for themselves, helping them to stay in their own homes, living closely to their loved ones, family and local community. The services provided from Domiciliary Care SA include physical assistance, rehabilitation and personal care, as well as providing respite and support for carers. (Domiciliary Care SA website 2008) Domiciliary Care SA is currently using ONI+ (Electronic telephone based assessment tool) assessment to measure (or to predict) the length of the patients will stay. The shorter the length of stay for a patient, the less benefit (or “riskier” in Data mining terminology) Domiciliary Care would obtain. This is because the Domiciliary Care cannot provide a thorough service for this client. However, such measures are not precise enough for the purpose of categorising clients by risk. The project will assess the current risk assessment measures, and find new effective indicators. 4.1 Data Xun Lu 48 Investigation of sub-patterns discovery and its applications The sample data consisting 783 set of records have been provided for investigation used. The data were stored in Excel file. Each piece of record represent one client from Domiciliary Care SA and each piece of record consists of at most 42 attributes. The reason of different number of attributes consisted is because patients suffer from different health conditions. Data has been supplied for a 36-month period from 1st July 2005 to 30th June 2008 for Domiciliary Care SA clients who were active and whose episodes have closed within this period. Length of stay was calculated using the Episode Open Date field and Episode Close Date fields. Only the data that was current for the client when their episode was closed were included, otherwise there would have been multiple lines of data for some clients. Domiciliary Care SA currently categorises clients into five (5) streams – Basic, Package, Rehabilitation, Dementia and Palliative. Palliative clients have been excluded from this data. Clients are assigned a coordination level depending on their needs. Each client can have multiple Health Condition Codes, but they are not listed in the order of importance – these have therefore been listed separately but in no specific order. It is believed within the agency that the average length of stay of a client is approximately 3 years. For the confidential reason, all client numbers have been de-identified and no names have been provided to protect the identity of clients from Domiciliary Care SA. 4.2 Description of each data field Xun Lu 49 Investigation of sub-patterns discovery and its applications Appendix E is a detailed description of each data field. The field names in Field Name column are listed in the order of exact data sequence from left to right. The Field Order Of Priority column tells the importance of each data field. The higher priority deserves more effort to investigate as these data fields may have greater influence on the final result. For convenient data processing reason, a code or number is provided to substitute the actual data. The codes/numbers are listed under Data code or type column and their corresponding explanations are illustrated under Data description column. Here are the snap shots of actual data: Xun Lu 50 Investigation of sub-patterns discovery and its applications The data shown above have been pre-processed. Age in second column is discrete value so it needs to be categorised into different sections. There are four categories for age: [20-76], [77-83], [84-88], [89-101]. The occurrences determine each of the age categories and their span, i.e. they are evenly categorised. Column SubAreas is generated from Suburbcode. Each Suburbcode represents a suburb which belongs to one of four areas: North, West, South and East. There are many columns with number codes. Each number code represents one particular value. For example: Accommodation type has 22 different number codes, where code 1 represents Private residence – owned; code 2 represents Private residence – private rental etc. details of each number representation is list on Appendix E. The number code representation facilitates the algorithm to analysis and run. The columns named in numbers represent diseases. The most common ten diseases have been selected to investigate. The rest least common diseases have been ignored. For the values of each disease, “Y” means this patient has such disease whereas “?” means this patient does not suffer from such disease. Xun Lu 51 Investigation of sub-patterns discovery and its applications The last column EpisodeLength(yr) is the target attribute.it is assumed that the length of stay with less than one year is considered to be short, greater or equal to three years is considered to be long. Xun Lu 52 Investigation of sub-patterns discovery and its applications Chapter 5 Mining results discussion 5.1 Algorithm MORE 5.1.1 Data preparation for algorithm MORE Data need to be transferred into correct formats before running the algorithm. For algorithm MORE, it requires two files. One is called .data file (figure 5.1), as its name specifies, it contains all the data in it. This is the file that is transformed directly from .xls file to .csv file, with comma ‘,’ separating every column. Also we can see many question marks in the .data file. The question mark means the value in that cell is missing, or contains Null value. This is a very common situation in data mining research. Another file is called .name file (figure 5.2) which contains all the attributes (columns) and all possible values (domain) to be analysed. It is worth to note that I put the target attribute EpisodeLength(yr) on the top with values 1 and 3 respectively representing less than one year and greater than three years, so that the algorithm will know how the result to be generated. We can also write ignore after an attribute name which we do not want algorithm to analyse, for example, attribute clientNumber and attribute SuburbCode. Xun Lu 53 Investigation of sub-patterns discovery and its applications Figure 5.1 a snapshot of .data file for both MORE and Magnum Opus Figure 5.2 Xun Lu .name file for application Li-rule 54 Investigation of sub-patterns discovery and its applications 5.1.2 Result discussion for algorithm MORE Application Li-rule which implements MORE can compute two patterns: Risk Pattern and Preventive pattern. If we regard the patients who stay less than 12 months in Domiciliary Care as risk target, and we want to know the common character, we can let Li-rule uses Risk Patten to compute. On the contrast, we regard the patients who stay longer than three years in Domiciliary Care as safe target. Li-rule can employ Preventive Pattern mode to calculate what these patients are in common. MORE Test 1: with minimum support .0.1, maximum LHS size 4 and RHS Episode Length = 3 (years) Rule 1: Explanation: 84 people have the following characteristics: (11.98% in the studied sample)  ServiceTypeProvidedToClientCode = 1 (Domestic Assistance)  ServiceFrequencyCode = 11 (Fortnightly) This cohort of people is 4.14 times more likely to stay longer than three years than other people. Xun Lu 55 Investigation of sub-patterns discovery and its applications Rule 2: Explanation: 30 people have the following characteristics: (4.28% in the studied sample)    CoordinationLevel = 2 (Medium) Stram = 2 (Packaged care) ServiceFrequencyCode = 11 (Fortnightly) This cohort of people is 6.63 times more likely to stay longer than three years than other people. Xun Lu 56 Investigation of sub-patterns discovery and its applications Rule 3: Explanation: 22 people have the following characteristics: (3.14% in the studied sample)     Ages ranging from 80 – 89 CarerAvailabilityCode = 2 (Has no carer) CoordinationLevel = 3 (Low) ServiceTypeProvidedToClientCode = 1 (Domestic Assistance) This cohort of people is 6.38 times more likely to stay longer than three years than other people. Rule 4: Xun Lu 57 Investigation of sub-patterns discovery and its applications Explanation: 27 people have the following characteristics: (3.85% in the studied sample)  Ages ranging from 80 – 89  Stream = 1 (Basic)  ServiceTypeProvidedToClientCode = 1 (Domestic Assistance) This cohort of people is 6.0 times more likely to stay longer than three years than other people. Rule 5: Xun Lu 58 Investigation of sub-patterns discovery and its applications Explanation: 28 people have the following characteristics: (3.99% in the studied sample)    ServiceTypeProvidedToClientCode = 1 (Domestic Assistance) CarerAvailabilityCode = 2 (Has no carer) AccomodationTypeCode = 3 (Private residence - public rental) This cohort of people is 5.61 times more likely to stay longer than three years than other people. MORE Test 2: with minimum support .0.1, maximum LHS size 4 and RHS Episode Length = 1 (years) Rule 1 Explanation: 62 people have the following characteristics: (8.84% in the studied sample) • ServiceFrequencyCode = 5 (i.e. Twice per week) • High blood pressure (the code is 921). This cohort of people is 11.9 times more likely to stay shorter than one year than other people. Xun Lu 59 Investigation of sub-patterns discovery and its applications Rule 2 Explanation: 95 people have the following characteristic: (13.55% in the studied sample)  ServiceTypeProvidedToClientCode = 5 (i.e. Social Support) This cohort of people is 19.0 times more likely to stay shorter than one year than other people. Rule 3 Explanation: 64 people have the following characteristics: (9.13% in the studied sample) • ServiceFrequencyCode = 5 (i.e. Twice per week) Xun Lu 60 Investigation of sub-patterns discovery and its applications • High blood pressure (the code is 921). This cohort of people is 11.9 times more likely to stay shorter than one year than other people. Rule 4: Explanation: 63 people have the following characteristics: (8.99% in the studied sample) • CarerAvailabilityCode = 1 (i.e. Has carer) • Dementia - Alzheimer's disease 1 (Code is 500) This cohort of people is 5.7 times more likely to stay shorter than one year than other people. Rule 5: Xun Lu 61 Investigation of sub-patterns discovery and its applications Explanation: 111 people have the following characteristics: (15.83% in the studied sample) • CoordinationLevel = 1 (i.e. High) • Stream = 201 (i.e. Dementia) This cohort of people is 5.4 times more likely to stay shorter than one year than other people. I only selected the top five rules from both Risk pattern and Preventive pattern. However, there are as many as over 100 rules (and sub-rules) can be generated from each pattern mode. The presentation of these rules is listed by the RR (Relative Risk) with the highest be the first of all rules. 5.2 Algorithm Opus Application Magnum Opus (demo version) is applied to test the results generated by algorithm Opus. Xun Lu 62 Investigation of sub-patterns discovery and its applications The demo version of Magnum Opus is sufficient for this testing as we only have 783 records to be analysed and demo version is able to handle up to 1000 records. 5.2.1 Data preparation for algorithm Opus The process of data preparation for Opus is very similar to the ones employed in Li-Rule, it requires two files as well, one is called .nam file, and another one is .data file. Figure 5.3 below illustrates .nam file which contains all the attributes that are to be analysed. The difference from the .name file in Li-Rule is that here the target attributes (episode length) need to be written down. Second difference is that it does not require a “.” at the end of each attribute line. For the .data file is identical to the one used in Li-Rule. Xun Lu 63 Investigation of sub-patterns discovery and its applications Figure 5.3 .nam file for application Magnum Opus Demo The following shows the steps to generate Opus result from using the same Domiciliary Care data source. 5.2.2 Xun Lu Result discussion for algorithm Opus 64 Investigation of sub-patterns discovery and its applications After successfully loaded the .nam file and .data file into Magnum Opus, the user interface will look like this: By default, every attribute with every value is selected on both LHS and RHS. However, this can be decided by user so that which attributes they want to appear on either side. I will need to search for rules using different correlated measures and different minimum support. For filter method, I chose Filter Out Insignificant because The Insignificant filter is useful for discarding rules and itemsets that are very likely to be spurious [30]. Xun Lu 65 Investigation of sub-patterns discovery and its applications Opus test 1: with minimum support .0.1, maximum size 4 and RHS Episode Length = 3 Xun Lu 66 Investigation of sub-patterns discovery and its applications And the result for this one is: We have 16 rules been found. Xun Lu 67 Investigation of sub-patterns discovery and its applications Opus test 2: with minimum support .0.05, maximum size 4 and RHS Episode Length = 3 Because the rules with minimum support 0.1 must be contained in the rules with minimum support 0.05. So I have excluded the rules generated from minimum support 0.1. Details below: Xun Lu 68 Investigation of sub-patterns discovery and its applications Opus test 3: the result with minimum support .0.05, maximum size 4 and RHS Episode Length = 1 Only 5 rules have been found. It is surprised to note that most of the rules generated by Magnum Opus are general and simple. Take Opus test 1 as an example, the first and second rules are stream=2  EL=3 and Gender=F  EP=3. Then the third rule is simple combination of the first two rules stream=2 & Gender=F  EP=3. All these three results satisfy the criteria, however, this is not users really want to see because it is so obvious to know that if rule three is true, then it infers that rule one and two must be true as well and rule one and two need to be pruned. Alternatively, Magnum Opus can modify its presentation in which rule one and two can be placed as sub-rule of rule three. Xun Lu 69 Investigation of sub-patterns discovery and its applications If I changed the Search by from Support to Leverage, Lift, Strength or others, the result will not be affected, apart from the order they are listed. Xun Lu 70 Investigation of sub-patterns discovery and its applications Chapter 6 Conclusion and future work In this research, I have compared and discovered the difference between mining algorithms STUCCO, Magnum Opus and MORE, and how these differences can benefit each other. Three algorithms representing different mining techniques were selected to compare. These three algorithms are STUCCO algorithm representing χ2 (chi-square); Magnum Opus algorithm representing lift and leverage; and algorithm MORE representing relative risk. χ2 , lift and leverage were described in chapter 3.1.2; relative risk is explained in chapter3.2.3.2; and all_cnfidence and cosine were described in appendix B. The results generated from different mining algorithms greatly depend on what correlation measures they use. As we can see from the discussion on the previous chapter, the Magnum Opus result distinct from MORE results in many ways, no matter on the number of rules generated or the quality and the interestingness of the rules. For Magnum Opus, only few rules are generated and most of them are too general and simple. On the other hand, most of rules generated by MORE have attributes which cannot found in Magnum Opus. Even if I narrowed down the specific attributes appear in a rule in MORE, Magnum Opus was unable to generate the same one. Xun Lu 71 Investigation of sub-patterns discovery and its applications Future work on improving MORE algorithm The ideal steps to implement data analysis are that, first use all_confidence or cosine to analyse, then inspect if the result are weakly positive or negative correlated. Next step is to apply addition tests such as lift, leverage or chi-square to obtain a more precise result [22]. What MORE can be beneficial from the above technique are: before starting MORE computing, first, STUCCO can be ran first to get the optimal support threshold, which increases the time and effort by guessing the proper support. But I failed to obtain a valid application that runs on STUCCO algorithm in the research, so I can only assume what it should be. Second, apply all_confidence or cosine to check whether the given dataset is positively or negatively correlated. This step is particularly important if the dataset is big and contains null-transactions. When it comes to post-pruning stage, theoretically, the pruning strategy from OPUS algorithm can be used in MORE because what MORE generates is also a tree structure result represented by contingency tables. The results can be sorted into a specific order that the nodes consisting uninteresting item should be placed first so that they can be pruned at first stage. However, as it has been demonstrated in the three Opus tests in the last chapter, Magnum Opus failed to show us the rules we want. The results contain a lot of redundant and trivial rules, which is undesirable. It comes to a conclusion that, in terms of the evaluating and analysing the medical data, Li-Rule outperforms Magnum Opus, which means Relative Risk is the first option to be employed to generate patterns for medical research. This research can continue on by getting the valid version of STUCCO application to run the given data from Domiciliary Care. Secondly, a more functional and comprehensive version of Magnum Opus application may be helpful to perform a further investigation on the data, which may bring a different results. Xun Lu 72 Investigation of sub-patterns discovery and its applications Xun Lu 73 Investigation of sub-patterns discovery and its applications Appendix A – Annotated Bibliography Adhikari, A & Rao, PR 2008, 'Mining conditional patterns in a database', Pattern Recognition Letters, vol. 29, no. 10, pp. 1515-1523, The authors, researchers at SP Chowgule College and Goa University, introduce the notion of conditional pattern in a database. Conditional patterns are interesting and useful for solving many problems. Normally, what I have been doing on data mining research is bases on frequent itemsets. This paper mainly focuses on analysis on the characteristics of a database in more detail by mining various patterns with respect to frequent itemsets by a proposed algorithm and conditional patterns in database. In addition, it also studies some kind of association among items which is not immediately available from frequent itemsets and association rules. This is a very good attempt as some other interesting patterns may be found in the hidden data itemsets. Bastide, Y, Pasquier, N, Taouil, R, Stumme, G & Lakhal, L 2000, 'Mining Minimal Non-redundant Association Rules Using Frequent Closed Itemsets', in Computational Logic — CL 2000, pp. 972-986, This paper mainly talks how they use the closure of the Galois connection, the two new bases for association rules to identify the association rules. As there is no free copy of this paper available on the internet, I cannot simply judge this paper is useful or not. Bay, SD & Pazzani, MJ 2001, 'Detecting Group Differences: Mining Contrast Sets', Data Mining and Knowledge Discovery, vol. 5, pp. 213-246, This paper written by Bay, S. D. and Pazzani, M. J., the researchers from University of California, tries to find out the contrast groups, i.e. the conjunctions of attributes and values that differ meaningfully in their distribution across groups. This allows us to Xun Lu 74 Investigation of sub-patterns discovery and its applications answer the questions like: "How are History and Computer Science students different?" and "What has changed from 1993 through 1998?".They use mining algorithm STUCCO for discovery of contrast sets. This paper is quite relevant to my research and it is worth reading in more detail. Charles, EM 2006, 'Receiver Operating Characteristic Analysis: A Tool for the Quantitative Evaluation of Observer Performance and Imaging Systems', vol. 3, no. 6, pp. 413-422, The author, Charles, E. Metz, provide a detailed information about Receiver operating characteristic (ROC) in this paper, and ROC is supposed to be very important knowledge in data mining field. Unfortunately, this paper is not free on internet, but similar information can be obtained from Wikipedia to compensate the loss. Cortes, C & Pregibon, D 2001, 'Signature-Based Methods for Data Streams', Data Mining and Knowledge Discovery, vol. 5, pp. 167-182, Although this is a paper is of data mining technology in it, it does not really emphasize on using or developing data mining tool or algorithm to solve problem, instead, it is more about data stream and signatures. So I think this paper is irrelevant to my research. Giudici, P & Castelo, R 2001, 'Association Models for Web Mining', Data Mining and Knowledge Discovery, vol. 5, pp. 183-196, This paper is more on graphical models using data web mining technology. It has little information can be adopted from. Xun Lu 75 Investigation of sub-patterns discovery and its applications Giudici, P, Heckerman, D & Whittaker, J 2001, 'Statistical Models for Data Mining', Data Mining and Knowledge Discovery, vol. 5, pp. 163-165, This paper may be interesting and useful to my research topic as it is about statistic modelling in data mining, but again, it is unfortunately unavailable on internet. Han, J & Pei, J 2000, 'Mining frequent patterns by pattern-growth: methodology and implications', SIGKDD Explor. Newsl., vol. 2, no. 2, pp. 14-20, The authors, Jiawei Han and Jian Pei at Simon Fraser University, have introduced a new methodology, frequent pattern growth , which mines frequent patterns without candidate generation in large databases. The intended audience of this paper could be anyone that are interested in mining frequent patterns by using other methods. This paper could be a very good reference for comparing different mining algorithms. The result of this method has superior efficiency than other methods, which is desirable, and thus this is a very useful and a relevant paper for my research topic. Han, J, Pei, J, Yin, Y & Mao, R 2004, 'Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach', Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53-87, This paper is similar to "Mining frequent patterns by pattern-growth methodology and implications". It introduces a new data mining candidate set generation-and-test approach which overcomes the inefficient generation problem. It proposes a novel frequent-pattern tree (FP-tree) structure for mining the complete set of frequent patterns by pattern growth. This paper is very close to my research topic, so it is obvious relevant. Hanley, J & McNeil, B 1982, 'The meaning and use of the area under a receiver operating characteristic (ROC) curve', Radiology, vol. 143, no. 1, April 1, 1982, pp. 29-36, Xun Lu 76 Investigation of sub-patterns discovery and its applications This paper can be regarded as a compensation of "Receiver Operating Characteristic Analysis: A Tool for the Quantitative Evaluation of Observer Performance and Imaging Systems". It provides one of a very essential data mining knowledge. Although this may not be closely related to my research, it is expected to be read by all audience who wish to do the research on data mining field. Hu, J & Mojsilovic, A 2007, 'High-utility pattern mining: A method for discovery of high-utility item sets', Pattern Recognition, vol. 40, no. 11, pp. 3317-3324, This paper is about an algorithm that works for frequent item set mining that identifies high-utility item combinations. The difference between the traditional association rule mining techniques and this algorithm is that this algorithm is to find segments of data, defined through combinations of few items (rules), which satisfy certain conditions as a group and maximize a predefined objective function. However, I don't think this is very relevant to my research, but is still worth to go through. Igor, K 2001, 'Machine learning for medical diagnosis: history, state of the art and perspective', Artificial intelligence in medicine, vol. 23, no. 1, pp. 89-109, The paper provides an overview of the development of intelligent data analysis in medicine from a machine learning perspective: a historical view, a state-of-the-art view, and a view on some future trends in this subfield of applied artificial intelligence. This is not a paper that focus very specifically on one topic, however, it is a very good paper to gain some basic knowledge about data mining, e.g. aive Bayesian classifier, neural networks and decision trees. But there is not free copy of this paper available on internet, so it may be disregarded. Xun Lu 77 Investigation of sub-patterns discovery and its applications Li, J & Wong, L 2003, 'Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL', in Advances in Web-Age Information Management, pp. 254-265 This paper provides a good comparison between two data mining algorithms, C4.5 and PCL, which is a new algorithm that invented by authors, and it is supposed to overcome the weaknesses from algorithm C4.5. This paper will be helpful to my research as I will be using different kinds of algorithms to compare the efficiency. Liu, B, Hsu, W & Ma, Y 1999, 'Mining association rules with multiple minimum supports', paper presented at the Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, San Diego, California, United States, This paper is very important and will be very helpful to my research. It discusses the issue of setting a proper minimum support in data mining. The general problem called rare item problem which means that if the minimum support is set too high, those rules may not be found; if the minimum support is set too low, too many rules, including useless ones are found. This paper proposes a novel technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. Loekito, E & Bailey, J 2006, 'Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams', paper presented at the Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, This paper tends to solve the problem where the number of dimensions is large in datasets. However, I don't think it has very close relation to my project research. As what I am supposed to do is to analysis a generated data result and do not need to take too much care on the result generating process. Xun Lu 78 Investigation of sub-patterns discovery and its applications Lucas, P 2004, 'Bayesian analysis, pattern analysis, and data mining in health care', Current Opinion in Critical Care, vol. 10, no. 5, pp. 399-403, From its abstract, this paper seems relevant to my research as it mentions the current role of data mining and Bayesian methods in biomedicine and heath care, in particular critical care. Meanwhile techniques from machine learning are being used to solve biomedical and health-care problems. But unfortunately, it is not available on internet freely. . Mitchell, T, Buchanan, B, DeJong, G, Dietterich, T, Rosenbloom, P & Waibel, A 1990, 'Machine Learning', Annual Review of Computer Science, vol. 4, no. 1, pp. 417-433, Machine learning is a fundamental subject of data mining, and this journal article, although it is not really specifically related to my research, it is worth to read through and understand some important concepts in it. . Mitchell, TM 1999, 'Machine learning and data mining', Commun. ACM, vol. 42, no. 11, pp. 30-36, This Journal Article is supposed to provide a general knowledge about machine learning, and the relationship between machine learning and data mining. It has no any specific topic to emphasize, but it is ideal for anyone wants to learn some fundamental knowledge of data mining. Novak, PK & Lavraˇc, N 2009, 'Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining', p. 27, The authors of this paper Petra and Nada from Monash University contribute a new understanding of data mining by resenting a unified terminology, by explaining the Xun Lu 79 Investigation of sub-patterns discovery and its applications apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches. This paper is required for my research and is in need of careful reading and understand the concepts such as contrast set and subgroup mining. Ordonez, C 2006, 'Association rule discovery with the train and test approach for heart disease prediction', Information Technology in Biomedicine, IEEE Transactions on, vol. 10, no. 2, pp. 334-343, This paper written by Carlos Ordonez has introduced an algorithm that uses search constrains to reduce the number of rules, searches for association rules on a training set, and finally validates them on an independent test set. It uses Heart disease prediction as an example throughout this document, which is really desirable to my research. The method going to be used in my research can be compared to the algorithm being mentioned here. Ordonez, C 2006, 'Comparing association rules and decision trees for disease prediction', paper presented at the Proceedings of the international workshop on Healthcare information and knowledge management, Arlington, Virginia, USA, The author of this paper, Carlos Ordonez from University of Houston, gives the comparison between association rules and decision trees with respect to accuracy, interpretability and applicability in the context of heart disease prediction. This paper is closely related to my research topis. In addition, it uses simple language and example to illustrate the constraints from different data mining techniques being used. Ordonez, C, Ezquerra, N & Santana, C 2006, 'Constraining and summarizing association rules in medical data', Knowledge and Information Systems, vol. 9, no. 3, pp. 1-2, Xun Lu 80 Investigation of sub-patterns discovery and its applications This is also very useful work in comparison with the previous ones as it focuses on medical domain where data sets are generally high dimensional and small. It also talks about the disadvantage about mining association rules in a high dimensional data set. This paper is also closely related to the research I have been doing. Quinlan, JR 1986, 'Induction of decision trees', Machine Learning, vol. 1, no. 1, pp. 81-106, Although this paper does not focus on medical field, it summarises an approach to synthesising decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. It is a useful material for knowing decision trees, which is a very important methodology in data mining field. Rindfleisch, T 1997, 'Privacy, information technology, and health care', Communications of the ACM, vol. v40, no. n8, p. p92(9), From the abstract of this paper, it is more about health care and information security management, rather than data mining discussion. So this is not a desirable paper to read. Turner, DA 1978, 'An Intuitive Approach to Receiver Operating Characteristic Curve Analysis', J Nucl Med, vol. 19, no. 2, February 1, 1978, pp. 213-220, This paper, written by David A Turner, has a very detailed explanation on the use of ROC curves and also provides a comparison between ROC curves and sensitivity, specificity, and percentage accuracy. It is a very good source to know about these concepts and will also be useful on my research as well. Xun Lu 81 Investigation of sub-patterns discovery and its applications Viveros, MSN, J. P. Rothman, M. J. 1996, 'Applying Data Mining Techniques to a Health Insurance Information System', PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, no. 22, pp. 286-294 This journal article focuses on health insurance system, which may be related to my research, but it has no free copy available on internet, so this paper has been disregarded. Webb, GI 2000, 'Efficient search for association rules', paper presented at the Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, Massachusetts, United States, This paper talks about the efficiency of Apriori algorithm. Apriori can impose large computational overheads when the number of frequent itemsets is very large. The author has presented an algorithm for association rule analysis based on the efficient OPUS search algorithm. This paper will be relatively helpful to my research. Webb, GI & Zhang, S 2005, 'K-Optimal Rule Discovery', Data Mining and Knowledge Discovery, vol. 10, no. 1, pp. 39-79, This paper mainly focuses on the efficiency comparison between a wide range of K-optimal rule discovery tasks. But it is not directly related to my research topic. Zeng, Z, Wang, J & Zhou, L 2008, 'Efficient Mining of Minimal Distinguishing Subgraph Patterns from Graph Databases', in Advances in Knowledge Discovery and Data Mining, pp. 1062-1068, This paper studies the problem of mining the complete set of minimal DGPs (distinguishing graph patterns) with any number of positive graphs, arbitrary positive Xun Lu 82 Investigation of sub-patterns discovery and its applications support and negative support. This paper is required to be read through from my supervisor. Xun Lu 83 Investigation of sub-patterns discovery and its applications Appendix B – Two other correlated measures: all_confidence and cosine In appendix B, we will discover two more useful correlated measures: all_conficence and cosine.  all_conficence Assuming an itemset X contains k items, X = {i1, i2, …, ik}, according to Han & Kamber 2006, all_conficence is defined as: all _ conf ( X )  sup( X ) sup( X )  max_ item _ sup( X ) max{sup( i j ) | i j  X } where max{sup( i j ) | i j  X } , called the max_item_sup of the itemset X, is the item with the highest support among all the items in itemset X. The all_confidence of X is the minimal confidence among the set of rules ij(X- ij), where (X- ij) represents an itemset X without the item ij in it.  cosine Given two itemsets A and B, the correlated measures cosine of A and B is defined as follow: cos ine ( A, B)  Xun Lu P( A  B) sup( A  B)  P( A)  P( B) sup( A)  sup( B) 84 Investigation of sub-patterns discovery and its applications The formula of cosine is similar to the formula of lift, except there is an important difference that cosine has a square root of P(A)×P(B) where lift does not. The effect of the square root √ ensures the cosine value is not influenced by the total number of transactions, i.e. the total number of itemsets in the whole database to be analysed (Han & Kamber, 2006). Xun Lu 85 Investigation of sub-patterns discovery and its applications Appendix C – OPUS Algorithms Algorithm: OPUS_AR(CurrentLHS, AvailableLHS, AvailableRHS) com CurrentLHS is the set of condition in the LHS of the rule currently being considered. com AvailableLHS is the set of conditions that may be added to the LHS of rules to be explore below this point. com AvailableRHS is the set of conditions that may appear on the RHS of a rule in the search space at this point and below. 1. SoFar := {} 2. FOR EACH P in AvailableLHS (a) NewLHS := CurrentLHS  {P} (b) AvailableLHS := AvailableLHS – P (c) IF pruning rules cannot determine that x  AvailableLHS:  y  AvailalbleRHS:  credible(x  NewLHS  y) THEN i. NewAvailableRHS = AvailableRHS ii. FOR EACH Q in AvailableRHS A. IF credible(NewLHS  Q) THEN record NewLHS  Q B. IF Pruning rules determine that  x  AvailableLHS: X={}   credible(x  NewLHS  Q) THEN NewAvailableRHS := NewAvailableRHS-Q iii. IF NewAvailableRHS ≠ {} THEN OPUS_AR(NewLHS, SoFar, NewAvailableRHS) iv. Xun Lu SoFar := SoFar  {P} 86 Investigation of sub-patterns discovery and its applications Appendix D – Background knowledge relating algorithm MORE Algorithm MORE Input: data set D, the minimum support  (set by user) in abnormal class a, and the minimum relative risk threshold  . Output: optimal risk pattern set R Global data structure: l-pattern set for 1≤l (An l-pattern contains l attribute-value pairs) 1. Set R=Ø 2. Count support of 1-patterns in the abnormal class 3. Generate(1-pattern set) 4. Select risk patterns and add them to R 5. New pattern set  Generate(2-pattern set) 6. While new pattern set is not empty 7. Count supports of candidates in new pattern set 8. Prune(new pattern set) 9. Add patterns with relative risk greater than  to R 10. Prune remaining superfluous patterns in R 11. New pattern set  Generate(next level pattern set) 12. Return R Where words in bold represent functions: Function Generate((l+1)-pattern set) 1. Let (l+1)-pattern set be empty set 2. For each pair of patterns Sl-1p and sl-1Q in l-pattern set 3. Inset candidate Sl-1pg in (l+1)-pattern set 4. For all Sl  Sl-1pg 5. If Sl does not exist in l-pattern set 6. Then remove candidate Sl-1pg 7. Return (l+1)-pattern set Function Prune((l+1)-pattern set) Xun Lu 87 Investigation of sub-patterns discovery and its applications 1. For each pattern S in (l+1)-pattern set 2. If supp(Sa)/supp(a) ≤  then remoce pattern S 3. Else if there is a sub-pattern S’ in l-pattern set such that supp(S’)=supp(S) or supp(S’¬a)=supp(S¬a) 4. Then remove pattern S 5. Return Proof of lemma Let PQx be a proper super pattern of PQ. PQx = Px and PQ = P when Q = Ø. To prove this lemma, we need to show that RR(PQx  a) ≤ RR(PQ  a). RR(PQ  a) = sup p( PQa ) sup p(( PQ)) sup p(( PQ)a ) sup p( PQ) = conf ( PQ  a ) conf ( PQx  a )  conf (( PQ)  a ) conf (( PQ)  a ) (1) RR(PQ  a) ≥ conf ( PQ  a ) =RR(PQx  a) conf (( PQ )  a ) (2) It can be deduced that supp(PQ¬a)=sup(PQx¬a) for any Q from supp(P¬a)=supp(Px¬a). Proof of step (1): Consider f(y) = (y/y+α) monotonically increases with y when constant α>0 and supp(PQ)≥supp(PQx)>0 conf(PQ  a) = = sup p( PQa ) sup p( PQa ) = sup p( PQa )  sup p( PQa ) sup p( PQ) sup p( PQa ) sup p( PQxa) ≥ sup p( PQa )  sup p( PQxa ) sup p( PQxa)  sup p( PQxa ) = conf(PQxa) Proof of step (2): From sup(PQ¬a)=sup(PQx¬a) we know that sup((PQ) ¬x¬a)=0. conf( ( PQx )  a )= Xun Lu sup( ( PQx )a ) sup( ( PQ))  sup( ( PQx )a ) = sup( ( PQx )) sup( ( PQx )) 88 Investigation of sub-patterns discovery and its applications = sup( ( PQ))  sup( ( PQ) xa )  sup( ( PQ)xa ) sup( ( PQx )) = sup( ( PQx ))  sup( ( PQ)a ) (because sup( ( PQ)xa ) =0) sup( ( PQx )) ≥ sup( ( PQ))  sup( ( PQ)a ) sup( ( PQ)) = sup( ( PQ)a ) sup( ( PQx )) = conf( ( PQ)  a ) From this lemma, if the support of a pattern (or an itemset) {a, b, ¬c} is equal to the support of a pattern (or an itemset) {a, ¬c}, we can then be certain that all its super sets (e.g. {a, b, ¬c}, {a, d, ¬c}, {a, c, d, ¬c}) will not be considered and should be pruned (Li etc, 2008). Xun Lu 89 Investigation of sub-patterns discovery and its applications Appendix E – Data Field Description Field Order of Priority Field Name Field Description Data code or type Client Number Unique client identifier Age in years when episode was closed 6 digit number 3 Age 6 7 Gender Suburb 1 Episode Open Date 2 Episode Close Date 12 Pension Type 5 Carer Availability Does the client have a carer available? 4 Carer Residency 16 Carer Relationship Does this carer reside with the client? If has no carer (as per carer availability) field will be blank What is the relationship to the client of this carer? 17 Usual Living Arrangements Xun Lu Suburb where the client resides Date that this current episode opened Date that this current episode closed Type of pension this client receives Usual living arrangements of the client Data description Number Text Text Male/Female dd/mm/yyyy Dates to be used to calculate length of stay dd/mm/yyyy Pension Type Code (Number) 1 2 2.1 2.2 2.3 3 4 5 6.1 6.2 6.3 6.4 6.5 6.6 6.7 7 7.1 7.2 99 Carer Availability Code 1 2 9 0 Carer Residency Code 1 2 Pension Type Description (Text) Aged DVA DVA - Gold DVA – White DVA – Spouse Disability Carer Unemployment Related Widows Sole Parent Sickness Workers Compensation Part Pension Blind Pension Health Care Card No government pension Superannuation Overseas pension Not stated Carer Availability Description Has carer Has no carer Not stated Not applicable Carer Relationship Code 1 2 3 4 5 6 7 8 9 10 11 12 13 99 Usual Living Arrangement Code 1 Carer Relationship Description Wife/female partner Husband/male partner Mother Father Daughter Son Daughter-in-law Son-in-law Other relative/female Other relative/male Friend/Neighbour – Female Friend/Neighbour – Male Private employee Not stated Carer Residency Description Co-resident carer Non-resident carer Usual Living Arrangement Description 90 Investigation of sub-patterns discovery and its applications 2 3 9 0 8 Fee Waiver Status 9 Fee Waiver Reason 11 Fee Waiver Effective Date Fee Waiver Expiry Date Coordination level 13 Coordination level start date Coordination level end date Stream 14 Stream start date Stream end date Services Provided to Client 15 Service start date Service end date Service Frequency 18 Accommodation Type If client is eligible for a waiver The reason a financial waiver has been granted N Y Waiver Reason Code 1 2 3 5 6 7 8 9 dd/mm/yyyy dd/mm/yyyy Coordination level that this client requires 1 – High 2 – Medium 3 - Low dd/mm/yyyy dd/mm/yyyy How Domiciliary Care SA has streamed this client 01 – Basic 02 – Packaged Care 0201 – Dementia 0202 - Rehabilitation dd/mm/yyyy dd/mm/yyyy Personal Care Domestic Assistance Respite Social Support Recovery Support Dementia Day Program dd/mm/yyyy dd/mm/yyyy Service Frequency Code 1 2 3 4 5 6 7 8 9 10 11 13 14 15 98 Accommodation Type Code 1 2 3 4 5 6 Xun Lu Lives alone Lives with family Lives with others Not stated Not applicable No Waiver Waiver Waiver Reason Description Financial Waiver - ongoing High Risk Program exemption Deceased/Closed Financial (RDNS) Financial (Short term) Financial (Health Cover) No Waiver Date the waiver was effective from Date the waiver was effective to Service Frequency Description 1/day x 5 days per week 2/day x 5 days per week 3/day x 5 days per week Once per week Twice per week 3 times per week 1/day x 7 days per week 2/day x 7 days per week 3/day x 7 days per week 4 times per week Fortnightly Once off visit Full day Once ever 4 weeks Other Accommodation Type Description Private residence - owned Private residence – private rental Private residence – public rental Private residence – mobile home Independent living unit within retirement village Boarding house 91 Investigation of sub-patterns discovery and its applications 7 8 9 10 11 12 13 14 15 16 17 18 19 21 99 10 Xun Lu Health Condition Codes Text Emerg/transitional accommodation Supported living facility Supported accommodation facility Residential aged care facility Mental health care facility Public place/temporary shelter Private residence rented from the Aboriginal community Temporary Aboriginal shelter Hospital Extended care/rehab facility Palliative care facility/hospice Not applicable Other Private residence – family member Not stated See separate listing 92 Investigation of sub-patterns discovery and its applications References [1] Ramamohanarao, K Bailey, J & Fan, H 2005, Efficient Mining of Contrast Patterns and Their Applications to Classification, IEEE Computer Society. [2] Bay, SD & Pazzani, MJ 2001, 'Detecting Group Differences: Mining Contrast Sets', Data Mining and Knowledge Discovery, vol. 5, pp. 213-246. [3] Bailey, J & Dong G 2007, Contrast Data Mining: Methods and Applications, IEEE ICDM 28-31 [4] Webb, GI, Butler, S & Newlands, D 2003, On detecting differences between groups, ACM, Washington, D.C. [5] Dong, G & Li, J 1999, Efficient mining of emerging patterns: discovering trends and differences, ACM, San Diego, California, United States. [6] Dong, G Zhang, X Wong, L and Li, J. CAEP: Classification by Aggregating Emerging Patterns. Technical report, March 1999. [7] Webb, GI 2000, Efficient search for association rules, ACM, Boston, Massachusetts, United States. [8] Quinlan, J.R C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. [9] Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version 2.0. [http://www.ipums. umn.edu/]. [10] Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between sets of items in massive database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216. Xun Lu 93 Investigation of sub-patterns discovery and its applications [11] Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, p. 750. [12] Webb, GI. 1995. OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3:431-465. [13] Ordonez, C Omiecinski, E Levien de Braal, Cesar Santana, and Ezquerra, N. Mining constrained association rules to predict heart disease. In IEEE ICDM Conference, pages 433–440, 2001. [14] Domiciliary Care SA 2008, South Australia view 3 November 2008, < http://www.domcare.sa.gov.au/ > [15] Li, J Fu, A and Fahey, P Efficient Discovery of Risk Patterns in Medical Data, Artificial Intelligence in Medicine, 45, 77-89, 2009. [16] Petrus, K and Li, J and Fahey, P. Comparing Decision Tree and Optimal Risk Pattern Mining for Analysing Emergency Ultra Short Stay Unit, 2007 [17] Ordonez, C 2006, 'Association rule discovery with the train and test approach for heart disease prediction', Information Technology in Biomedicine, IEEE Transactions on, vol. 10, no. 2, pp. 334-343. [18] Roberto J, Bayardo J, Agrawal R. Mining the most interesting rules. In: Fayyad U, Chaudhuri S, Madigan D, editors. Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. New York: ACM; 1999. p. 145 154. [19] Webb GI, Zhang S. K-optimal rule discovery. Data Mining and Knowledge Discovery Journal 2005;1:39 - 79. Xun Lu 94 Investigation of sub-patterns discovery and its applications [20] Zaki MJ. Mining non-redundant association rules. Data Mining and Knowledge Discovery Journal 2004; 3: 223-48 [21] Domiciliary Care SA 2008, South Australia view 3 November 2008, < http://www.domcare.sa.gov.au/ > [22] Han, J & Kamber M 2006 “Data Mining: Concepts and Techniques” Elsevier Inc, San Francisco.USA [23] Information Management 2002 United States viewed11 March 2010, http://www.information-management.com/news/5329-1.html [24] Leverage 2009 Australia viewed 13 March 2010, http://www.giwebb.com/Doc/MOleverage.html [25] Bayardo, R.J., Agrawal, R, and Gunopuls, D. 1999. Constraint-based rule mining in large, dense databases. In Proceeding 15th International Conference on Data Engineering [26] Rymon, R 1992. Search through systematic set enumeration. Third International Conference on Principles of Knowledge Representation and Reasoning. [27] P-Value 2010 Wikipedia viewed 18 March 2010, http://en.wikipedia.org/wiki/P-value [28] Triola MM, Triola MF. Biostatistics for the biological and health sciences, 2nd edition. Boston, 2004 [29] Wikipedia 2008, Odds ratio view 4 November 2009 < http://en.wikipedia.org/wiki/Odds_ratio > Xun Lu 95 Investigation of sub-patterns discovery and its applications [30] Association Discovery with Magnum Opus 4.3Australia viewed 13 Jun 2010, <http://www.giwebb.com/tutorialfr.html> [31] Wikipedia 2010, Attribute-value pair viewed 1 August 2010 <http://en.wikipedia.org/wiki/Attribute-value_pair> Xun Lu 96

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download DOC Version