Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of Hepatitis Dataset using Multirelational Association Rules Luciene Cristina Pizzi 1, Marcela Xavier Ribeiro2, Marina Teresa Pires Vieira3 1 Department of Computer Science, Federal University of São Carlos, São Carlos, SP, Brazil {luciene_pizzi,marina}@dc.ufscar.br 2 Department of Computer Science, University of São Paulo, São Carlos, SP, Brazil [email protected] 3 Faculty of Exact and Natural Sciences , Methodist University of Piracicaba, Piracicaba, SP, Brazil [email protected] Abstract. The hepatitis dataset was analyzed based on the mining of multirelational association rules. Experiments were conducted to analyze data on blood and urine exams and biopsy results to infer information on the behavior of degrees of fibrosis. Multirelational association rules were obtained using a new algorithm called Connection that identifies patterns in different tables without join them. The results of our analysis are discussed here, as is the Connection algorithm. 1 Introduction The Hepatitis dataset compiled by Chiba University Hospital contains information on patients’ exams dating from 1982 to 2001. The dataset contains a large amount of data distributed irregularly through the period in question, making a direct analysis by specialists impossible. An interesting approach, therefore, is to apply data mining techniques to uncover valuable information from this dataset. We report here on the use of an association rule algorithm called Connection [1, 2] to reveal multirelational association rules in Hepatitis data. This algorithm identifies patterns in different tables that have at least one common attribute, and the joint operation on these tables is inadequate because it produces spurious tuples. Section 2 of this paper sets forth the concepts of multirelational data mining and describes the algorithm Connection, while section 3 introduces the approach used for mining the dataset and discusses the pre-processing procedure, section 4 reports on the results and, lastly, section 5 presents our conclusions. 2 Multirelational Data Mining 2.1 Rationale Figure 1 depicts the main tables involved in the hepatitis database. The patients’ various exams are not directly related, so joining these tables for a common analysis fails to provide a suitable dataset for discovering association rules based on traditional data mining algorithms such as Apriori [4], FP-Growth [5]. In other words, the results deriving from joint tables may lead to data redundancy and thence to distortions in the calculation of the support and confidence measures of interest. Hematological Analysis Interferon Therapy Results of Biopsy Patient In-Hospital Examination Out-Hospital Examination Fig. 1 – Hepatitis dataset tables To better explain this problem, consider the three tables in Figure 2, which contain data on urinalysis and biopsy results, join them in a third table based on the attributes {MID, Month, Year}. Consider, also, that our aim is to define whether these two types of exams are related. In Figure 2, note that the tuple (MID=772, Month=2, Year=1999, Fibrosis=F4) appears in 20% of the Biopsy table, while the data of the same tuple occurs in 50% of the joint table. This distortion is due to the spurious tuples resulted in the Joint Table, which is not in the Fourth Normal Form1. This difference can cause distortions in the calculation of the measures of interest of the rules deriving from the mining of the joint table, or prevent the discovery of interesting patterns. Therefore, to analyze the datasets correctly and obtain rules for the biopsy results and the other types of exams, we applied the Connection algorithm to the hepatitis database. 2.2 The Connection Algorithm The Connection algorithm mines Boolean association rules from several tables that have at least one attribute in common, without joining the tables. This algorithm was originally proposed to mine data from data warehouses [1, 2], but the proposed method can be used to mine multiple tables of a relational database. 1 From the Normalization Theory for the Relational Model. MID Month Year Fibrosis MID Month Year 760 760 2 1994 Joint Table Urinalysis Biopsy F1 2 1994 MID Month Year Fibrosis Result Result I-BIL_N 760 2 1994 F1 I-BIL_N 760 2 1994 F1 GOT_H 772 3 1999 F2 760 2 1994 GOT_H 772 2 1999 F4 772 3 1999 TP_N 772 3 1999 F2 GOT_H F2 772 2 1999 PLT_VL 772 2 1999 F4 T-BIL_N F3 772 2 1999 T-BIL_N 772 2 1999 F4 PLT_VL 772 2 1999 GPT_H 772 2 1999 F4 GPT_H 893 5 1982 ALP_N 773 894 4 8 1989 1992 Joined Fig. 2 – Exam tables and their combination The Connection algorithm uses some new measures of interest and considers the concepts of blocks and segments. The blocks are a set of tuples of one table with the values of one or more attributes in common. Figure 2 shows the blocks of the Biopsy and Urinalysis tables in alternate colors. The attributes in common here are MID, Month and Year. A block is the unit of analysis of a mining process. Blocks from different tables, having the same values for the attributes in common, are related into a set through the process of mining association rules. This set of blocks is called a segment. The arrows in Figure 2 indicate correspondence among blocks. Some blocks do not form segments, as illustrated by the values in bold in Figure 2. The various parameters proposed in the literature to quantify the interest of a rule are not directly applicable to multirelational association rules. Therefore, Ribeiro [1,2] adapted the support and confidence parameters to be applied in cases involving data from multiple related tables. In the definition of these measures of interest, T represents the set of related tables, while X and Y represent itemsets of T. The support# of an itemset X is the ratio between the number of segments of T in which X occurs and the total number of segments of T, while the support# of a multirelational association rule X→Y of T is the ratio between the number of segments in which X and Y occur together and the total number of segments of T. The confidence# of a multirelational association rule X→Y of T is the ratio between the number of segments in which X and Y occur together and the number of segments of T in which X occurs. Having defined these measures of interest, we can say that an itemset X is a frequent# itemset if it satisfies the minimum support#, and a rule A is a strong# rule if it satisfies the minimum values established for confidence# and support#. Mining multirelational association rules consists of finding all the strong rules in a set T of two or more related tables. Example: Consider the rule GPT_H → F4, sup#=0.33, conf#=0.5, found by the Connection algorithm in the tables of Figure 2. This rule was found in 33% of the segments, and in 50% of the segments where item GPT_H occurred, item F4 also occurred. The multirelational data mining processed by the Connection algorithm is more complex than the traditional data mining process, since it must identify segments in order to relate the data of multiple tables. The Connection algorithm consists of the following steps: identification of the segments; calculation of support# of the itemsets of each table, determination of the local frequent# itemsets; determination of the global frequent# itemsets and generation of the multirelational association rules. 3 Approach Adopted Among the topics suggested by the Hepatitis dataset, we proposed to evaluate whether the stage of liver fibrosis can be estimated based on laboratory tests. Our aim was to discover whether biopsies can be replaced by lab tests, since the former involve invasive procedures. The approach adopted here consisted of analyzing blood and urine tests together with biopsy results, seeking patterns that might indicate a correlation between the patients’ exam results and the degree of their fibrosis. We began our analysis with a pre-processing phase due to the irregularity of the available data. In this phase, null values and noisy data were eliminated and the data were organized in different intervals of time. We also reduced the data by selecting only the most important exams, i.e., GOT, GPT, ZTT, TTT, T-BIL, D-BIL, I-BIL, ALB, CHE, T-CHO, TP, WBC, PLT, RBC, HGB, HCT and MCV. These choices were made based on the work reported by ref. [3]. In this experiment, the only contributions to the analysis were the results of exams done over a period when measurements were made of the degree of fibrosis for the same patient. To properly analyze the time period involved, the exam date was divided into two attributes: year and month. The period of time considered for the analysis was one month. The exam results of patients with more than one result for the same type of exam in a onemonth period were averaged. Biopsies were treated differently. Fibrosis was considered stable 500 days before and 500 days after a biopsy, according to ref. [3]. The degrees of fibrosis and exam results were discretized according to thresholds specified by [3], whose values are not presented here due to space limitations. The most interesting attributes for the analysis were then selected. Blood and urine exam attributes were MID, Month, Year and Result, and biopsy attributes were MID, Month, Year and Fibrosis. After the pre-processing phase, a GOT_N value indicated that the patient had a Normal value for the GOT exam, while a form F3 value indicated that the patient’s degree of fibrosis was equal to 3. The next section presents the results obtained with the Connection algorithm. 4 Results and Discussion The rules obtained through the application of the Connection algorithm revealed some tendencies indicating the existence of exams whose values were related with the patient’s degree of fibrosis, and exams whose results did not show this correlation. To better visualize our results, the graphs below illustrate the confidence# of the rules of the form examination → Fibrosis, where the results of the exam are ascribed values of Ultra Low (UL), Very Low (VL), Low (L), Normal (N), High (H), Very High(VH) and Ultra High(UH), depending on the exam analyzed, and Fibrosis is ascribed values ranging from F0 to F4. The points where the confidence# is equal to zero indicate the Connection algorithm did not identify this rule. The tendencies were organized into three groups, as listed below. Group 1 refers to the exams whose values change according to the stage of liver fibrosis of the patient. The exams encompassed by this standard were RBC, ALB, PLT and D-BIL. To illustrate this situation, we have chosen PLT for discussion. The confidence# of the rule PLT_VL → F4 is equal to 50%, while the confidence# of the rule PLT_VL → F1 is equal to 14%, as indicated on the left-hand side of the graph in Figure 3. In contrast, the confidence# of the rule PLT_N → F4 is equal to 5%, while the confidence# of the rule PLT_N → F1 is equal to 61%. CHE Analysis PLT Analysis 1 0.6 0.4 F0. F1. F2. F3. F4. 0.8 Confidence# Confidence# 0.8 0.6 0.4 0.2 0.2 0 1 F0. F1. F2. F3. F4. UL VL L N H 0 VL L N H VH Fig. 3 – Confidence# of the rules found Therefore, we can state that values Very Low for the PLT analysis tend to occur in patients with high levels of fibrosis, like F3 and mainly F4. In addition, the highest values found for the PLT analysis, in this case the Normal values, tend to occur in patients with fibrosis level F1. Data relating to patients with fibrosis level F0 were of minor significance in the rules obtained due to their rare occurrence in the analyzed data. The situation discussed here indicates the existence of a correlation between this group’s exams and the stage of liver fibrosis of the patient. Another tendency found (Group 2) related to the exams whose confidence# varied little among the different degrees of fibrosis of the patients. This group’s exams were: HCT, HGB, MCV, WBC, GOT, GPT, I-BIL, T-BIL, T-CHO, TP and TTT. The results of the analysis of these exams indicated that they could not be used to estimate the stage of liver fibrosis. Group 3 refers to the CHE and ZTT exams, whose behavior did not reveal a specific tendency. The graph on the right-hand side of Figure 3 presents the confidence# of the rules involving the CHE exam. As can be seen, the values of confidence# varied considerably at all levels of fibrosis and no degree of fibrosis was identified with rules have of very dissimilar confidence#. Thus, the analysis of the rules obtained by the algorithm, which combine these exams with others, may be useful, e.g.: CHE_L, HCT_L, MCV_H → F4, sup#=0.01, conf#=0.70. In the three groups, the support# of the rules involving fibrosis level F1 was much higher than that of the others. Except for fibrosis level F0, which occurred in a minor portion of the data, the rules involving low degrees of fibrosis, such as F1 and F2, showed higher support#. 5 Conclusions We presented a strategy to analyze the Hepatitis dataset for the purpose of estimating the stage of liver fibrosis of patients based on lab tests. The strategy consisted of using a multirelational data mining algorithm which discovers rules among several related tables. The rules obtained here successfully identified exams whose values related to patients’ levels of fibrosis, and also exams that were probably uncorrelated with levels of fibrosis. Finally, it should be noted that the results obtained here were influenced by the quality of the analyzed data. The irregularity of the data may have prevented the discovery of interesting patterns involved in some exams. Acknowledgements L.C. Pizzi and M.X. Ribeiro gratefully acknowledge the scholarships granted to them, respectively, by CAPES and FAPESP (Brazil). References 1. 2. 3. 4. 5. Ribeiro, M.X. Data Mining in Multiples Fact Tables of Data Warehouses. 131 pp. Master Dissertation. Departament of Computer Science, Federal University of São Carlos, Brazil, 2004. Ribeiro, M.X.; Vieira, M.T.P. A New Approach for Mining Association Rules in Data Warehouses. In 6th International Conference On Flexible Query Answering Systems, Lyon, France, 2004. T. Watanabe, E. Suzuki, H. Yokoi, and K. Takabayashi. Application of prototypeline to chronic hepatitis data. In Working note of ECML/PKDD-2003 Discovery Challenge, p. 166–177, 2003. Agrawal, R.; Srikant, R. Fast algorithms for mining association rules. In Proc. of the Int'l Conf. on Very Large Databases, Santiago de Chile, Chile, 1994. Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, Vol. 29, Dallas, Texas, USA, 2000. p. 1-12.