Download Analysis of Hepatitis Dataset using Multirelational Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Analysis of Hepatitis Dataset using Multirelational
Association Rules
Luciene Cristina Pizzi 1, Marcela Xavier Ribeiro2, Marina Teresa Pires Vieira3
1
Department of Computer Science, Federal University of São Carlos, São Carlos, SP, Brazil
{luciene_pizzi,marina}@dc.ufscar.br
2
Department of Computer Science, University of São Paulo, São Carlos, SP, Brazil
[email protected]
3
Faculty of Exact and Natural Sciences , Methodist University of Piracicaba, Piracicaba, SP, Brazil
[email protected]
Abstract. The hepatitis dataset was analyzed based on the mining of
multirelational association rules. Experiments were conducted to analyze data on
blood and urine exams and biopsy results to infer information on the behavior of
degrees of fibrosis. Multirelational association rules were obtained using a new
algorithm called Connection that identifies patterns in different tables without
join them. The results of our analysis are discussed here, as is the Connection
algorithm.
1
Introduction
The Hepatitis dataset compiled by Chiba University Hospital contains information on
patients’ exams dating from 1982 to 2001. The dataset contains a large amount of data
distributed irregularly through the period in question, making a direct analysis by
specialists impossible. An interesting approach, therefore, is to apply data mining
techniques to uncover valuable information from this dataset.
We report here on the use of an association rule algorithm called Connection [1, 2] to
reveal multirelational association rules in Hepatitis data. This algorithm identifies patterns
in different tables that have at least one common attribute, and the joint operation on these
tables is inadequate because it produces spurious tuples.
Section 2 of this paper sets forth the concepts of multirelational data mining and
describes the algorithm Connection, while section 3 introduces the approach used for
mining the dataset and discusses the pre-processing procedure, section 4 reports on the
results and, lastly, section 5 presents our conclusions.
2
Multirelational Data Mining
2.1
Rationale
Figure 1 depicts the main tables involved in the hepatitis database. The patients’
various exams are not directly related, so joining these tables for a common analysis fails
to provide a suitable dataset for discovering association rules based on traditional data
mining algorithms such as Apriori [4], FP-Growth [5]. In other words, the results deriving
from joint tables may lead to data redundancy and thence to distortions in the calculation
of the support and confidence measures of interest.
Hematological
Analysis
Interferon
Therapy
Results of
Biopsy
Patient
In-Hospital
Examination
Out-Hospital
Examination
Fig. 1 – Hepatitis dataset tables
To better explain this problem, consider the three tables in Figure 2, which contain
data on urinalysis and biopsy results, join them in a third table based on the attributes
{MID, Month, Year}. Consider, also, that our aim is to define whether these two types of
exams are related.
In Figure 2, note that the tuple (MID=772, Month=2, Year=1999, Fibrosis=F4)
appears in 20% of the Biopsy table, while the data of the same tuple occurs in 50% of the
joint table. This distortion is due to the spurious tuples resulted in the Joint Table, which
is not in the Fourth Normal Form1. This difference can cause distortions in the calculation
of the measures of interest of the rules deriving from the mining of the joint table, or
prevent the discovery of interesting patterns.
Therefore, to analyze the datasets correctly and obtain rules for the biopsy results and
the other types of exams, we applied the Connection algorithm to the hepatitis database.
2.2
The Connection Algorithm
The Connection algorithm mines Boolean association rules from several tables that
have at least one attribute in common, without joining the tables. This algorithm was
originally proposed to mine data from data warehouses [1, 2], but the proposed method
can be used to mine multiple tables of a relational database.
1
From the Normalization Theory for the Relational Model.
MID Month Year Fibrosis
MID Month Year
760
760
2
1994
Joint Table
Urinalysis
Biopsy
F1
2
1994
MID Month Year Fibrosis Result
Result
I-BIL_N
760
2
1994
F1
I-BIL_N
760
2
1994
F1
GOT_H
772
3
1999
F2
760
2
1994
GOT_H
772
2
1999
F4
772
3
1999
TP_N
772
3
1999
F2
GOT_H
F2
772
2
1999 PLT_VL
772
2
1999
F4
T-BIL_N
F3
772
2
1999 T-BIL_N
772
2
1999
F4
PLT_VL
772
2
1999
GPT_H
772
2
1999
F4
GPT_H
893
5
1982
ALP_N
773
894
4
8
1989
1992
Joined
Fig. 2 – Exam tables and their combination
The Connection algorithm uses some new measures of interest and considers the
concepts of blocks and segments. The blocks are a set of tuples of one table with the
values of one or more attributes in common. Figure 2 shows the blocks of the Biopsy and
Urinalysis tables in alternate colors. The attributes in common here are MID, Month and
Year. A block is the unit of analysis of a mining process.
Blocks from different tables, having the same values for the attributes in common, are
related into a set through the process of mining association rules. This set of blocks is called a
segment. The arrows in Figure 2 indicate correspondence among blocks. Some blocks do not
form segments, as illustrated by the values in bold in Figure 2.
The various parameters proposed in the literature to quantify the interest of a rule are
not directly applicable to multirelational association rules. Therefore, Ribeiro [1,2]
adapted the support and confidence parameters to be applied in cases involving data from
multiple related tables. In the definition of these measures of interest, T represents the set
of related tables, while X and Y represent itemsets of T.
The support# of an itemset X is the ratio between the number of segments of T in
which X occurs and the total number of segments of T, while the support# of a
multirelational association rule X→Y of T is the ratio between the number of segments in
which X and Y occur together and the total number of segments of T.
The confidence# of a multirelational association rule X→Y of T is the ratio between
the number of segments in which X and Y occur together and the number of segments of
T in which X occurs.
Having defined these measures of interest, we can say that an itemset X is a frequent#
itemset if it satisfies the minimum support#, and a rule A is a strong# rule if it satisfies the
minimum values established for confidence# and support#. Mining multirelational
association rules consists of finding all the strong rules in a set T of two or more related tables.
Example: Consider the rule GPT_H → F4, sup#=0.33, conf#=0.5, found by the
Connection algorithm in the tables of Figure 2. This rule was found in 33% of the segments,
and in 50% of the segments where item GPT_H occurred, item F4 also occurred.
The multirelational data mining processed by the Connection algorithm is more
complex than the traditional data mining process, since it must identify segments in order
to relate the data of multiple tables. The Connection algorithm consists of the following
steps: identification of the segments; calculation of support# of the itemsets of each table,
determination of the local frequent# itemsets; determination of the global frequent#
itemsets and generation of the multirelational association rules.
3
Approach Adopted
Among the topics suggested by the Hepatitis dataset, we proposed to evaluate whether the
stage of liver fibrosis can be estimated based on laboratory tests. Our aim was to discover
whether biopsies can be replaced by lab tests, since the former involve invasive procedures.
The approach adopted here consisted of analyzing blood and urine tests together with
biopsy results, seeking patterns that might indicate a correlation between the patients’
exam results and the degree of their fibrosis.
We began our analysis with a pre-processing phase due to the irregularity of the
available data. In this phase, null values and noisy data were eliminated and the data were
organized in different intervals of time.
We also reduced the data by selecting only the most important exams, i.e., GOT, GPT,
ZTT, TTT, T-BIL, D-BIL, I-BIL, ALB, CHE, T-CHO, TP, WBC, PLT, RBC, HGB, HCT
and MCV. These choices were made based on the work reported by ref. [3]. In this
experiment, the only contributions to the analysis were the results of exams done over a
period when measurements were made of the degree of fibrosis for the same patient.
To properly analyze the time period involved, the exam date was divided into two
attributes: year and month. The period of time considered for the analysis was one month.
The exam results of patients with more than one result for the same type of exam in a onemonth period were averaged.
Biopsies were treated differently. Fibrosis was considered stable 500 days before and
500 days after a biopsy, according to ref. [3]. The degrees of fibrosis and exam results
were discretized according to thresholds specified by [3], whose values are not presented
here due to space limitations.
The most interesting attributes for the analysis were then selected. Blood and urine
exam attributes were MID, Month, Year and Result, and biopsy attributes were MID,
Month, Year and Fibrosis.
After the pre-processing phase, a GOT_N value indicated that the patient had a Normal
value for the GOT exam, while a form F3 value indicated that the patient’s degree of fibrosis
was equal to 3. The next section presents the results obtained with the Connection algorithm.
4
Results and Discussion
The rules obtained through the application of the Connection algorithm revealed some
tendencies indicating the existence of exams whose values were related with the patient’s
degree of fibrosis, and exams whose results did not show this correlation.
To better visualize our results, the graphs below illustrate the confidence# of the rules
of the form examination → Fibrosis, where the results of the exam are ascribed values of
Ultra Low (UL), Very Low (VL), Low (L), Normal (N), High (H), Very High(VH) and
Ultra High(UH), depending on the exam analyzed, and Fibrosis is ascribed values
ranging from F0 to F4. The points where the confidence# is equal to zero indicate the
Connection algorithm did not identify this rule.
The tendencies were organized into three groups, as listed below.
Group 1 refers to the exams whose values change according to the stage of liver
fibrosis of the patient. The exams encompassed by this standard were RBC, ALB, PLT
and D-BIL. To illustrate this situation, we have chosen PLT for discussion.
The confidence# of the rule PLT_VL → F4 is equal to 50%, while the confidence# of
the rule PLT_VL → F1 is equal to 14%, as indicated on the left-hand side of the graph in
Figure 3. In contrast, the confidence# of the rule PLT_N → F4 is equal to 5%, while the
confidence# of the rule PLT_N → F1 is equal to 61%.
CHE Analysis
PLT Analysis
1
0.6
0.4
F0.
F1.
F2.
F3.
F4.
0.8
Confidence#
Confidence#
0.8
0.6
0.4
0.2
0.2
0
1
F0.
F1.
F2.
F3.
F4.
UL
VL
L
N
H
0
VL
L
N
H
VH
Fig. 3 – Confidence# of the rules found
Therefore, we can state that values Very Low for the PLT analysis tend to occur in
patients with high levels of fibrosis, like F3 and mainly F4. In addition, the highest values
found for the PLT analysis, in this case the Normal values, tend to occur in patients with
fibrosis level F1. Data relating to patients with fibrosis level F0 were of minor
significance in the rules obtained due to their rare occurrence in the analyzed data. The
situation discussed here indicates the existence of a correlation between this group’s
exams and the stage of liver fibrosis of the patient.
Another tendency found (Group 2) related to the exams whose confidence# varied
little among the different degrees of fibrosis of the patients. This group’s exams were:
HCT, HGB, MCV, WBC, GOT, GPT, I-BIL, T-BIL, T-CHO, TP and TTT. The results of the
analysis of these exams indicated that they could not be used to estimate the stage of liver
fibrosis.
Group 3 refers to the CHE and ZTT exams, whose behavior did not reveal a specific
tendency. The graph on the right-hand side of Figure 3 presents the confidence# of the
rules involving the CHE exam. As can be seen, the values of confidence# varied
considerably at all levels of fibrosis and no degree of fibrosis was identified with rules
have of very dissimilar confidence#. Thus, the analysis of the rules obtained by the
algorithm, which combine these exams with others, may be useful, e.g.:
CHE_L, HCT_L, MCV_H → F4, sup#=0.01, conf#=0.70.
In the three groups, the support# of the rules involving fibrosis level F1 was much higher
than that of the others. Except for fibrosis level F0, which occurred in a minor portion of the
data, the rules involving low degrees of fibrosis, such as F1 and F2, showed higher support#.
5
Conclusions
We presented a strategy to analyze the Hepatitis dataset for the purpose of estimating
the stage of liver fibrosis of patients based on lab tests. The strategy consisted of using a
multirelational data mining algorithm which discovers rules among several related tables.
The rules obtained here successfully identified exams whose values related to patients’
levels of fibrosis, and also exams that were probably uncorrelated with levels of fibrosis.
Finally, it should be noted that the results obtained here were influenced by the quality
of the analyzed data. The irregularity of the data may have prevented the discovery of
interesting patterns involved in some exams.
Acknowledgements
L.C. Pizzi and M.X. Ribeiro gratefully acknowledge the scholarships granted to them,
respectively, by CAPES and FAPESP (Brazil).
References
1.
2.
3.
4.
5.
Ribeiro, M.X. Data Mining in Multiples Fact Tables of Data Warehouses. 131 pp. Master
Dissertation. Departament of Computer Science, Federal University of São Carlos, Brazil, 2004.
Ribeiro, M.X.; Vieira, M.T.P. A New Approach for Mining Association Rules in Data Warehouses.
In 6th International Conference On Flexible Query Answering Systems, Lyon, France, 2004.
T. Watanabe, E. Suzuki, H. Yokoi, and K. Takabayashi. Application of prototypeline to chronic
hepatitis data. In Working note of ECML/PKDD-2003 Discovery Challenge, p. 166–177, 2003.
Agrawal, R.; Srikant, R. Fast algorithms for mining association rules. In Proc. of the Int'l Conf. on
Very Large Databases, Santiago de Chile, Chile, 1994.
Han, J.; Pei, J.; Yin, Y. Mining frequent patterns without candidate generation. In Proc. of the
ACM SIGMOD Int'l Conf. on Management of Data, Vol. 29, Dallas, Texas, USA, 2000. p. 1-12.