Download using fuzzy unordered rule induction algorithm for cancer data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
USING FUZZY UNORDERED RULE INDUCTION ALGORITHM FOR
CANCER DATA CLASSIFICATION
Madara Gasparovica and Ludmila Aleksejeva
Riga Technical University
Institute of Information Technology
Department of Modelling and Simulation
1 Kalku Street, Riga, LV-1658
Latvia
[email protected], [email protected]
Abstract: This paper studies the use of fuzzy logic in analysis and classification of bioinformatics data. The specific
character of the bioinformatics data, the large number of attributes and the corresponding small number of records,
asks for methods that can process such data and induce comprehensible and easily interpretable IF-THEN
classification rules. Applied experiments were performed using Breast cancer, Prostate cancer, Gastric cancer, Gastric
intestinal disease and Healthy donor data sets using Fuzzy Unordered Rule Induction Algorithm (FURIA). The study
also describes the working processes of FURIA algorithm. To test the classification results of the aforementioned data
sets additional experiments were carried out using bioinformatics data sets frequently used in literature. The paper also
gives conclusions about the parameters that influence classification results (by either increasing or decreasing the
accuracy) or are neutral to classification results.
Keywords: Fuzzy Logic; Bioinformatics data; Cancer; Fuzzy IF-THEN rules; Classification; FURIA.
1 Introduction
Fuzzy algorithms are one of the most popular methods due to their positive features like classification rules that are easy
to understand to humans, ability to process linguistic data and incorporate opinion of an expert; and they can be used
not only for classification. Fuzzy algorithms also have a great potential in processing of bioinformatics data with their
specific character – the large number of attributes (genes, antibodies and proteins) and a small number of records
(patients). This ratio is logical because the biological tests are very expensive and time-consuming.
This study analyzes Fuzzy Unordered Rule Induction Algorithm (FURIA) that can be used for classification of
bioinformatics data, describing its main principles of working and giving conclusions about its application. The applied
experiments have been carried out using Breast cancer, Prostate cancer, Gastric cancer, Gastric intestinal disease and
Healthy donor data sets from Latvian Biomedical Research and Study Centre. These data are unique because they are
recently obtained using a novel technique in the center that is the only in Latvia that deals with processing and obtaining
of such data. Therefore finding diagnostic patterns (searching for classification potential) using data mining methods is
promising.
This paper is organized as follows. Section 2 describes the FURIA algorithm in detail. Section 3 summarizes the data
sets, experiments performed, their results and observations. Section 4 concludes the paper, giving the main conclusions
and the perspective development direction of further researches.
2 Fuzzy Unordered Rule Induction Algorithm (FURIA)
This section describes the FURIA algorithm in detail. A study about other fuzzy technologies which can be used in
bioinformatics data classification is available in [1]. This algorithm was proposed by Hühn and Hüllermeier in 2009 [2].
It is an improvement of the RIPPER algorithm [3] that uses a modified RIPPER algorithm as a basis. A simplified
scheme is shown in Fig. 1. FURIA learns fuzzy rules and unordered rule set. The algorithm induces rules for each class
separately using the “one class – other classes” dividing strategy.
Fig. 1. FURIA (modified RIPPER algorithm)
When the classifier is trained using one class, other classes are not considered. This helps to achieve a state when there
is not one main rule and the sequence of classes in the training process is irrelevant. However, this approach also has its
shortcomings – if a record is equally covered by rules of two classes, certainty factor has to be calculated. The main
improvements of the RIPPER algorithm affect pruning modifications (see Fig. 1, building phase). However, the main
strength of this algorithm is the rule stretching method solving the pressing problem of new records that should be
classified could be outside the space covered by the previously induced rules. The representation of fuzzy rules is also
advanced, essentially, a fuzzy rule is obtained through replacing intervals by fuzzy intervals, namely fuzzy sets with
trapezoidal membership function [2]. A fuzzy interval of that kind is specified by four parameters and will be written as
I F = (φ s, L , φ c, L , φ c,U , φ s,U ) :
1
φ c, L ≤ v ≤ φ c,U

s, L
 v −φ
φ s , L < v < φ c, L

c
,
L
s
,
L

−φ
I F ( v) =  φ
s,U
−v
φ

φ c,U < v < φ s,U
 s,U
c,U
−φ
φ

0
else
(1)
φ c, L and φ c,U are lower and upper bound of the core of the fuzzy set; likewise, φ s, L and φ s,U are lower and
upper bound of the support [2]. The authors propose that antecedents are learned as a list α1 , α 2 , K, α m .The idea is
that ordering reflects the importance of antecedents, an assumption that is clearly justified by the underlying rule
learning algorithm. To re-evaluate generalized rules, can use measure:
(2)
p +1
k +1
×
,
p+n+2 m+2
where p is the number of positive examples covered by the rule; n is the number of negative examples
covered by the rule [2]. The FURIA rules are output in Weka tool in the following form:
(1047 in [0, 0.797186,inf, inf]) and (993 in [0, 0.68965, inf, inf]) => Category = HD (CF = 0.98),
where 1047 and 993 , are attribute names; the interval between 0 and 0.797186 (and 0, 0.68965 ) are the stretched
part of the rule. Operator inf, − inf points to the interval that has the last valid values. CF indicates the confidence
factor of the rule [2].
3 Practical experiments
This section explains the data sets used in the study, depicts their structure and describes the experiments carried out
and the obtained results.
3.1 Used data set description
The main data used in the course of execution of this study was obtained during the Latvian – Belarusian cooperation
project; the procured data characterize cancer and gastric intestinal disease and are provided by Latvian Biomedical
Research and Study Centre. More information about biological meaning of data set can be found in [4]. Data set
description by classes is given in Table 1. The summary shows that this data set holds five classes, four of which are
different diseases – breast, gastric and prostate cancer, and gastric intestinal disease, and the fifth class indicates healthy
donors; the specifics of the data set (the varying number of records for each class) indicates that the best choice would
be to divide the classes into pairs of one disease and the healthy donor data set.
Table 1. Data set description
Class name
Instances
Breast cancer (BrCa)
Gastric cancer (GaCa)
Prostate cancer (PrCa)
Healthy donor (HD)
Gastric intestinal disease (GIS)
13
173
52
155
126
Attributes
1229
Data format
Normalized
originally,
continuous data
The visualization of the data set with breast cancer and prostate cancer classes merged into one class labelled cancer
(CA) and the healthy donor data for two random attributes is shown in Fig. 2.
1.6
13 attribute
1.2
0.8
HD
CA
0.4
0
0
10
20
30
40
50
60
70
80
1229 attribute
Fig.2. Cancer data representation using two attributes
As can be seen from Fig. 2, both classes are not easily separable using the combination of the two attributes; therefore
this combination cannot provide a rule with a large confidence factor.
To test the obtained classification results, the experiment plan will be appended by experiments with data sets
frequently used in literature. Summarized information about data sets is given in Table 2.
Table 2. Data sets description
Data set:
Number of attributes:
Data type
Number of examples:
Classes:
Golub leukemia [5]
5147
Numerical
72
2 (ALL-47, AML-25)
Childhood tumors [6]
2308
Numerical
83
4 (EWS-29,BL-11,NB-18,RMS-25)
Prostate [7]
12533
Numerical
102
2 (Normal-50, Tumor-52)
Data set:
Number of attributes:
Data type
Number of examples:
Classes:
Lymphoma [8]
7070
Numerical
77
2 (DLBCL-58,FL-19)
Lung [9]
12600
Numerical
203
5 (AD-139, NL-17, SMCL-6, SQ21,COID-20)
ML leukemia [10]
12533
Numerical
72
3 (ALL-24,MLL-20, AML28)
Apparently the number of classes for the used data sets are in the range between 2 and 5. The number of attributes also
differs significantly ranging from 2308 to 12600. The number of records varies between 72 and 203. Again Table 2
shows that the number of records in each class differs, therefore the domination problem could be a considerable
problem; nevertheless none of the records is left out because of the small number of records.
3.2 Classification results with BMC data
Taking into account the number of classes and the number of records in each class, the data are divided into three
subsets – Gastric cancer and healthy donor data; Gastric intestinal disease and healthy donor data, and considering the
small number of records in Breast cancer and Prostate cancer data sets (13 and 52 records correspondingly) a merge of
both data sets and healthy donor data set for comparison.
An experiment plan was made for the following data sets: Gastric cancer and healthy donor; Gastric intestinal
disease and healthy donor; Breast cancer, Prostate cancer and healthy donor; all cancer types and healthy donor; all
cancer types merged in one class – cancer with healthy donor in the second class. All described experiments were
carried out using data mining tool Weka [11]. Whereas data normalization is a standard preprocessing stage in data
mining, the comparative experiments were executed with values normalized in interval [0,1] replacing the missing
values with the mean corresponding attribute values of the same class. An overview of the performed experiments is
given in Table 3.
Table 3. Experiments with cancer data sets
N.
Data set name
1
Gastric cancer and healthy
donor
Number
Time in
of rules
seconds1
Accuracy Error
Algorithm FURIA 10 - fold cross - validation
18
20.97
62%
38%
6
7.1
97%
3%
18
20.95
61%
39%
18
17.89
55%
45%
5
5.36
95%
5%
16
12.88
58%
42%
8
5.39
90%
10%
7
2.94
79%
21%
7
3.17
84%
16%
18
38.49
52%
48%
17
27.69
62%
38%
34
97.16
35%
64%
2
Gastric cancer and healthy
donor
Gastric cancer and healthy
donor
Gastric intestinal disease
and healthy donor
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Gastric intestinal disease
and healthy donor
Gastric intestinal disease
and healthy donor
Breast cancer, Prostate
cancer and healthy donor
Breast cancer, Prostate
cancer and healthy donor
Breast cancer, Prostate
cancer and healthy donor
Breast cancer, Prostate
cancer, Gastric cancer and
healthy donor
All cancer types (BrCa,
GaCa, PrCa) - (CA) and
healthy donor (HD)
All cancer types (BrCa,
GaCa, PrCa), GIS and
healthy donor (HD)
All cancer types (BrCa,
GaCa, PrCa) and GIS
Gastric cancer and healthy
donor
Gastric intestinal disease
and healthy donor
Breast cancer, Prostate
cancer and healthy donor
Breast cancer, Prostate
cancer and healthy donor
All cancer types merged
into one class - cancer (CA)
and healthy donor (HD)
26
36.22
52%
48%
Algorithm FLR 10 - fold cross - validation
7
1.06
58%
42%
7
0.56
54%
46%
5
0.59
79%
21%
5
0.56
77%
23%
3
0.67
62%
38%
Comments
original used data set (with
missing values)
normalized in interval [0,1],
missing values replaced with
average values in the same class
normalized in interval [0,1] with
missing values
original used data set (with
missing values)
normalized in interval [0,1],
missing values replaced with
average values in the same class
normalized in interval [0,1] with
missing values
original used data set (with
missing values)
used 76 healthy donor samples
both cancer types merged into
one class (CA)
original used data set (with
missing values)
all cancer types merged ino one
class; original used data set (with
missing values)
original used data set (with
missing values)
original used data set (with
missing values)
original used data set (with
missing values)
original used data set (with
missing values)
used 76 healthy donor samples
both cancer types merged into
one class (CA)
all cancer types merged into one
class; original used data set (with
missing values)
Comparison of the results outlined in Table 3. shows that only normalization in interval [0,1] does not give a significant
advancement of the results or even worsened the results (see experiments No. 1,3,4 and 6). Relevant increase in the
classifier accuracy was achieved by replacing the missing values with the mean values of the class (experiments No. 2
and 5) comparing to results obtained using analogical test and training sets (experiments 3 and 6), i.e., the results
improved up to 36%. Good classification results were attained (experiment 7) using breast and prostate cancer data set,
but this experiment used the original size of all data sets – BrCa-13, PrCa-52 and HD-155, and the last class dominated
in the data. To even out the results, the number of records belonging to class HD was randomly decreased to 76 records;
the new results (experiment 8) were slightly worse because of objective reasons. Then the results were tested by
merging both cancer classes of the experiment 8 into one class. The results improved (see experiment 9) showing that
distinguishing between a healthy individual and a cancer patient is easier than differentiating between types of cancer
1
Time taken to build model in seconds
that affect the individual. Experiment 11 does not confirm these results but there is a valid reason – the cancer class
dominated among the records of the used data set. Another conclusion can be drawn from the results – types of cancer
have lots in common because the classification accuracy is better for cancer sets than the classification result of all
diseases (experiments 11 and 12). To verify the classification efficacy of different diseases without healthy donor class
for comparison, experiment 13 was conducted that gave an identical result to experiment 10, therefore it can be
concluded that inclusion or removal of healthy donor data does not change the classification accuracy. The depiction of
accuracy/errors obtained in each experiment is shown in Fig.3. This type of diagram provides an easier means to
visually evaluate the classification accuracy and error calculations. It is apparent that the highest accuracy/smallest error
is in experiments 2, 5, 7 and 9.
100%
80%
60%
Accuracy
40%
Error
20%
0%
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Number of experiment
Fig.3. Classification accuracy and error
To test the results achieved by FURIA algorithm, five complementary experiments were performed using FLR [12]
algorithm (see 14-18). As can be seen, the results differ only slightly and algorithm FLR shows even better results
because the number of rules in each experiment is smaller meaning faster classification. However FURIA would
probably be superior in classification of new records eventually because of its advantage using rule stretching that
supports classification of new, previously unseen records ensuring greater classification accuracy.
3.3 Classification results with most popular data sets
To test the classification results obtained with BMC data, a series of experiments was conducted using bioinformatics
data sets frequently used in related literature. It was undertaken to compare the obtained classification results. All data
sets were normalized using the normalization tool built into WEKA software. The summarized classification results
using 10 fold – cross-validation for all six data sets using five different classification methods are shown in Table 4.
Table 4. Experiments with most popular data sets
Lymphoma
Lung
tumor
ML
leukemia
Prostate
Childhood
tumors
Number of rules
3
4
9
4
6
5
Time in seconds
3.68
6.67
68.14
5.33
23.02
3.8
Accuracy
83%
77%
90%
81%
81%
92%
Error
17%
23%
10%
19%
19%
8%
Number of used
genes
5147
7070
12600
12533
12533
2308
Number of rules
4
3
6
3
3
5
Accuracy
94%
91%
88%
85%
91%
92%
Error
6%
9%
12%
15%
9%
8%
Number of used
genes
717
483
1185
717
483
951
Algorithm/
author
FURIA [2]
Leukemia
Ho et al. [13]
Data set name
Prostate
92%
8%
94%
6%
Childhood
tumors
8
76%
24%
2
8
200
8
8
8
OhnoMachado et
al. [14]
Mramor et
al. [15]
Vinterbo
et al. [16]
Mramor
et al. [15]
Mramor
et al.
[15]
OhnoMachado et
al. [14]
Leukemia
Lymphoma
Number of rules
Accuracy
Error
Number of used
genes
2
79%
21%
Algorithm/Author
Algorithm/
author
See last row
ML
leukemia
91%
8%
Lung
tumor
21
99%
1%
Data set name
Table 4 demonstrates that two classification methods (FURIA and Ho et al. method) have results for all six data sets in
the literature; but the last block compiles results of three different methods. Comparing FURIA and Ho et al. method
based on the number of induced rules, it is evident that FURIA using a larger number of rules acquires an identical or a
slightly worse result. However it should be noted that experiments with the FURIA algorithm used the full data sets but
Ho et al. in their turn implemented feature selection and used attribute sets that were up to 25 times smaller. If the
results are compared based on the data set used and the results obtained it becomes apparent that the results for
Leukemia and Lung tumor data sets are adequate to the number of the used rules and the number of attributes that were
used in classification. Therefore FURIA shows average result. Lymphoma data set, as well as ML leukemia and
Prostate data set classification results using FURIA algorithm are worse than the accuracies of other methods but this is
partially due to the aforementioned feature selection. In Chilhood tumors data set, the results of FURIA are identical to
those of Ho et al. method but the third method demonstrates a worse result.
The acquired classification accuracy using six popular data sets and three methods is depicted in Fig.4. It is evident
that the FURIA method analyzed in this study achieves competitive results and the attained classification accuracy in all
experiments is above 75%.
100%
90%
80%
70%
60%
50%
Ho et al.
40%
FURIA
30%
Other
20%
10%
0%
Leukemia
Lymphoma
Lung tumor
ML leukemia
Prostate
Childhood
tumors
Fig.4. Classification accuracy
4 Conclusions
The theoretical part of the study analyzes the method proposed by Hühn et al. and gives a short overview of the FURIA
algorithm.
The execution of this study included making comparative experiments using the previously described disease and
healthy individual data sets. It can be concluded that:
• Data normalization in interval [0,1] does not give a significant improvement of classification results;
• Classification accuracy can be increased by replacing the missing values with the mean attribute value of the
same class;
• Neither class should be dominant to obtain adequate results, they should have approximately equal number of
records;
• Greater efficiency can be achieved in a data set where each type of the cancer is replaced by one class ‘cancer’,
i.e., it is easier to distinguish between cancer patients and healthy patients without the notion of cancer type.
The classification results obtained by using data sets that are frequently employed in the corresponding literature
suggest that the FURIA algorithm is competitive and has good classification accuracy. To evaluate classification
accuracy, it is crucial to consider the number of used attributes and the number of induced rules that classify the data
set. It can be considered that the rule-stretching feature of FURIA could provide a significant advantage in data sets
where there would be a need to classify records that differ from those used in training.
The possible directions of future work include continuing research using FURIA algorithm to determine the results
that can be achieved with other frequently used bioinformatics data sets. It is also planned to proceed with research
using other fuzzy techniques and algorithms that could be used in bioinformatics and, particularly, to solve the tasks of
this study using cancer, healthy donor and gastric intestinal disease data classification, as well as to use feature selection
and evaluate its impact on the obtained classification result.
Acknowledgements: This work has been supported by the European Social Fund within the project «Support for the
implementation of doctoral studies at Riga Technical University». This work has been developed in LATVIA –
BELORUS Co-operation programme in Science and Engineering within the project «Development of a complex of
intelligent methods and medical and biological data processing algorithms for oncology disease diagnostics
improvement», Scientific Cooperation Project No. L7631. Thanks to Dr.habil.sc.comp. Professor Arkady Borisov (Riga
Technical University) for help and support.
References:
[1] Gasparoviča M., Novoselova N., Aleksejeva L., Using Fuzzy Logic to Solve Bioinformatics Tasks, Proceedings of
Riga Technical University. Issue 5, Computer Science. Information Technology and Management Science, Vol.44,
2010, pp.99-105.
[2] Hühn J., Hüllermeier, E., FURIA: An Algorithm for Unordered Fuzzy Rule Induction, Data Mining and
Knowledge Discovery, Vol.19, No.3, 2009, pp.293-319.
[3] Cohen W. W., Fast Effective Rule Induction. Proceedings of the Twelfth International Conference on Machine
Learning, 1995, pp. 115-123.
[4] KalniĦa Z., et al., Evaluation of T7 And Lambda Phage Display Systems for Survey Of Autoantibody Profiles in
Cancer Patients, Journal of Immunological Methods, May 20, Vol. 334, No.1-2, 2008, pp.37-50.
[5] Golub T.R., Slonim D.K. et.al., Molecular Classification of Cancer: Class Discovery and Class prediction by Gene
Expression Monitoring, Science., Vol. 286, 1999, pp. 531-537.
[6] Khan J. et al., Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial
Neural Networks, Nature Medicine, Vol. 7, No. 6, June 2001, pp.673-679.
[7] Singh D. et al., Gene Expression Correlates of Clinical Prostate Cancer Behavior, Cancer Cell, Vol. 1, No. 2,
March 2002, pp.203-209.
[8] Shipp M.A., et al., Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and
Supervised Machine Learning. Nat Med., Vol.8, No.1, January 2002, pp. 68-74.
[9] Bhattacharjee A., et al., Classification of Human Lung Carcinomas by MRNA Expression Profiling Reveals
Distinct Adenocarcinoma Subclasses, Proc. Natl. Acad. Sci. U.S.A. Vol. 98, No.24, 2001, pp.13790–13795.
[10] Armstrong A.S. et al., MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a
Unique Leukemia, Nature Genetics, Vol.30, 2001, pp. 41 – 47.
[11] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H., The WEKA Data Mining Software: An
Update; SIGKDD Explorations, Vol. 11, No 1, 2009, pp.10-18.
[12] Kaburlasos V. G., Athanasiadis I. N., Mitkas P.A., Fuzzy Lattice Reasoning (FLR) Classifier and Its Application
for Ambient Ozone Estimation, International Journal of Approximate Reasoning, Vol.45, No.1, 2007, pp.152-188.
[13] Ho S.-Y., Hsieg C.-H., Chen H.-M., Huang. H.-L., Interpretable Gene Expression Classifier With an Accurate and
Compact Fuzzy Rule Base for Microarray Data Analysis, BioSystems, Vol. 85, 2006, pp.165-176.
[14] Ohno – Machado L., Vinterbo S., Weber G., Classification of Gene Expression Data Using Fuzzy Logic, J. Intell.
Fuzzy Syst., Vol. 12, 2002, pp. 19-24.
[15] Mramor M., Leban G., Demšar J., Zupan B., Visualization-based Cancer Microarray Data Classification Analysis,
Bioinformatics, Vol. 23, No.16, 2007, pp. 2147-2154.
[16] Vinterbo S.A., Kim E.-Y., Ohno – Machado L., Small, Fuzzy and Interpretable Gene Expression Based
Classifiers, Bioinformatics, Vol. 21, No. 9, 2005, pp. 1964-1970.