Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
USING FUZZY UNORDERED RULE INDUCTION ALGORITHM FOR CANCER DATA CLASSIFICATION Madara Gasparovica and Ludmila Aleksejeva Riga Technical University Institute of Information Technology Department of Modelling and Simulation 1 Kalku Street, Riga, LV-1658 Latvia [email protected], [email protected] Abstract: This paper studies the use of fuzzy logic in analysis and classification of bioinformatics data. The specific character of the bioinformatics data, the large number of attributes and the corresponding small number of records, asks for methods that can process such data and induce comprehensible and easily interpretable IF-THEN classification rules. Applied experiments were performed using Breast cancer, Prostate cancer, Gastric cancer, Gastric intestinal disease and Healthy donor data sets using Fuzzy Unordered Rule Induction Algorithm (FURIA). The study also describes the working processes of FURIA algorithm. To test the classification results of the aforementioned data sets additional experiments were carried out using bioinformatics data sets frequently used in literature. The paper also gives conclusions about the parameters that influence classification results (by either increasing or decreasing the accuracy) or are neutral to classification results. Keywords: Fuzzy Logic; Bioinformatics data; Cancer; Fuzzy IF-THEN rules; Classification; FURIA. 1 Introduction Fuzzy algorithms are one of the most popular methods due to their positive features like classification rules that are easy to understand to humans, ability to process linguistic data and incorporate opinion of an expert; and they can be used not only for classification. Fuzzy algorithms also have a great potential in processing of bioinformatics data with their specific character – the large number of attributes (genes, antibodies and proteins) and a small number of records (patients). This ratio is logical because the biological tests are very expensive and time-consuming. This study analyzes Fuzzy Unordered Rule Induction Algorithm (FURIA) that can be used for classification of bioinformatics data, describing its main principles of working and giving conclusions about its application. The applied experiments have been carried out using Breast cancer, Prostate cancer, Gastric cancer, Gastric intestinal disease and Healthy donor data sets from Latvian Biomedical Research and Study Centre. These data are unique because they are recently obtained using a novel technique in the center that is the only in Latvia that deals with processing and obtaining of such data. Therefore finding diagnostic patterns (searching for classification potential) using data mining methods is promising. This paper is organized as follows. Section 2 describes the FURIA algorithm in detail. Section 3 summarizes the data sets, experiments performed, their results and observations. Section 4 concludes the paper, giving the main conclusions and the perspective development direction of further researches. 2 Fuzzy Unordered Rule Induction Algorithm (FURIA) This section describes the FURIA algorithm in detail. A study about other fuzzy technologies which can be used in bioinformatics data classification is available in [1]. This algorithm was proposed by Hühn and Hüllermeier in 2009 [2]. It is an improvement of the RIPPER algorithm [3] that uses a modified RIPPER algorithm as a basis. A simplified scheme is shown in Fig. 1. FURIA learns fuzzy rules and unordered rule set. The algorithm induces rules for each class separately using the “one class – other classes” dividing strategy. Fig. 1. FURIA (modified RIPPER algorithm) When the classifier is trained using one class, other classes are not considered. This helps to achieve a state when there is not one main rule and the sequence of classes in the training process is irrelevant. However, this approach also has its shortcomings – if a record is equally covered by rules of two classes, certainty factor has to be calculated. The main improvements of the RIPPER algorithm affect pruning modifications (see Fig. 1, building phase). However, the main strength of this algorithm is the rule stretching method solving the pressing problem of new records that should be classified could be outside the space covered by the previously induced rules. The representation of fuzzy rules is also advanced, essentially, a fuzzy rule is obtained through replacing intervals by fuzzy intervals, namely fuzzy sets with trapezoidal membership function [2]. A fuzzy interval of that kind is specified by four parameters and will be written as I F = (φ s, L , φ c, L , φ c,U , φ s,U ) : 1 φ c, L ≤ v ≤ φ c,U s, L v −φ φ s , L < v < φ c, L c , L s , L −φ I F ( v) = φ s,U −v φ φ c,U < v < φ s,U s,U c,U −φ φ 0 else (1) φ c, L and φ c,U are lower and upper bound of the core of the fuzzy set; likewise, φ s, L and φ s,U are lower and upper bound of the support [2]. The authors propose that antecedents are learned as a list α1 , α 2 , K, α m .The idea is that ordering reflects the importance of antecedents, an assumption that is clearly justified by the underlying rule learning algorithm. To re-evaluate generalized rules, can use measure: (2) p +1 k +1 × , p+n+2 m+2 where p is the number of positive examples covered by the rule; n is the number of negative examples covered by the rule [2]. The FURIA rules are output in Weka tool in the following form: (1047 in [0, 0.797186,inf, inf]) and (993 in [0, 0.68965, inf, inf]) => Category = HD (CF = 0.98), where 1047 and 993 , are attribute names; the interval between 0 and 0.797186 (and 0, 0.68965 ) are the stretched part of the rule. Operator inf, − inf points to the interval that has the last valid values. CF indicates the confidence factor of the rule [2]. 3 Practical experiments This section explains the data sets used in the study, depicts their structure and describes the experiments carried out and the obtained results. 3.1 Used data set description The main data used in the course of execution of this study was obtained during the Latvian – Belarusian cooperation project; the procured data characterize cancer and gastric intestinal disease and are provided by Latvian Biomedical Research and Study Centre. More information about biological meaning of data set can be found in [4]. Data set description by classes is given in Table 1. The summary shows that this data set holds five classes, four of which are different diseases – breast, gastric and prostate cancer, and gastric intestinal disease, and the fifth class indicates healthy donors; the specifics of the data set (the varying number of records for each class) indicates that the best choice would be to divide the classes into pairs of one disease and the healthy donor data set. Table 1. Data set description Class name Instances Breast cancer (BrCa) Gastric cancer (GaCa) Prostate cancer (PrCa) Healthy donor (HD) Gastric intestinal disease (GIS) 13 173 52 155 126 Attributes 1229 Data format Normalized originally, continuous data The visualization of the data set with breast cancer and prostate cancer classes merged into one class labelled cancer (CA) and the healthy donor data for two random attributes is shown in Fig. 2. 1.6 13 attribute 1.2 0.8 HD CA 0.4 0 0 10 20 30 40 50 60 70 80 1229 attribute Fig.2. Cancer data representation using two attributes As can be seen from Fig. 2, both classes are not easily separable using the combination of the two attributes; therefore this combination cannot provide a rule with a large confidence factor. To test the obtained classification results, the experiment plan will be appended by experiments with data sets frequently used in literature. Summarized information about data sets is given in Table 2. Table 2. Data sets description Data set: Number of attributes: Data type Number of examples: Classes: Golub leukemia [5] 5147 Numerical 72 2 (ALL-47, AML-25) Childhood tumors [6] 2308 Numerical 83 4 (EWS-29,BL-11,NB-18,RMS-25) Prostate [7] 12533 Numerical 102 2 (Normal-50, Tumor-52) Data set: Number of attributes: Data type Number of examples: Classes: Lymphoma [8] 7070 Numerical 77 2 (DLBCL-58,FL-19) Lung [9] 12600 Numerical 203 5 (AD-139, NL-17, SMCL-6, SQ21,COID-20) ML leukemia [10] 12533 Numerical 72 3 (ALL-24,MLL-20, AML28) Apparently the number of classes for the used data sets are in the range between 2 and 5. The number of attributes also differs significantly ranging from 2308 to 12600. The number of records varies between 72 and 203. Again Table 2 shows that the number of records in each class differs, therefore the domination problem could be a considerable problem; nevertheless none of the records is left out because of the small number of records. 3.2 Classification results with BMC data Taking into account the number of classes and the number of records in each class, the data are divided into three subsets – Gastric cancer and healthy donor data; Gastric intestinal disease and healthy donor data, and considering the small number of records in Breast cancer and Prostate cancer data sets (13 and 52 records correspondingly) a merge of both data sets and healthy donor data set for comparison. An experiment plan was made for the following data sets: Gastric cancer and healthy donor; Gastric intestinal disease and healthy donor; Breast cancer, Prostate cancer and healthy donor; all cancer types and healthy donor; all cancer types merged in one class – cancer with healthy donor in the second class. All described experiments were carried out using data mining tool Weka [11]. Whereas data normalization is a standard preprocessing stage in data mining, the comparative experiments were executed with values normalized in interval [0,1] replacing the missing values with the mean corresponding attribute values of the same class. An overview of the performed experiments is given in Table 3. Table 3. Experiments with cancer data sets N. Data set name 1 Gastric cancer and healthy donor Number Time in of rules seconds1 Accuracy Error Algorithm FURIA 10 - fold cross - validation 18 20.97 62% 38% 6 7.1 97% 3% 18 20.95 61% 39% 18 17.89 55% 45% 5 5.36 95% 5% 16 12.88 58% 42% 8 5.39 90% 10% 7 2.94 79% 21% 7 3.17 84% 16% 18 38.49 52% 48% 17 27.69 62% 38% 34 97.16 35% 64% 2 Gastric cancer and healthy donor Gastric cancer and healthy donor Gastric intestinal disease and healthy donor 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Gastric intestinal disease and healthy donor Gastric intestinal disease and healthy donor Breast cancer, Prostate cancer and healthy donor Breast cancer, Prostate cancer and healthy donor Breast cancer, Prostate cancer and healthy donor Breast cancer, Prostate cancer, Gastric cancer and healthy donor All cancer types (BrCa, GaCa, PrCa) - (CA) and healthy donor (HD) All cancer types (BrCa, GaCa, PrCa), GIS and healthy donor (HD) All cancer types (BrCa, GaCa, PrCa) and GIS Gastric cancer and healthy donor Gastric intestinal disease and healthy donor Breast cancer, Prostate cancer and healthy donor Breast cancer, Prostate cancer and healthy donor All cancer types merged into one class - cancer (CA) and healthy donor (HD) 26 36.22 52% 48% Algorithm FLR 10 - fold cross - validation 7 1.06 58% 42% 7 0.56 54% 46% 5 0.59 79% 21% 5 0.56 77% 23% 3 0.67 62% 38% Comments original used data set (with missing values) normalized in interval [0,1], missing values replaced with average values in the same class normalized in interval [0,1] with missing values original used data set (with missing values) normalized in interval [0,1], missing values replaced with average values in the same class normalized in interval [0,1] with missing values original used data set (with missing values) used 76 healthy donor samples both cancer types merged into one class (CA) original used data set (with missing values) all cancer types merged ino one class; original used data set (with missing values) original used data set (with missing values) original used data set (with missing values) original used data set (with missing values) original used data set (with missing values) used 76 healthy donor samples both cancer types merged into one class (CA) all cancer types merged into one class; original used data set (with missing values) Comparison of the results outlined in Table 3. shows that only normalization in interval [0,1] does not give a significant advancement of the results or even worsened the results (see experiments No. 1,3,4 and 6). Relevant increase in the classifier accuracy was achieved by replacing the missing values with the mean values of the class (experiments No. 2 and 5) comparing to results obtained using analogical test and training sets (experiments 3 and 6), i.e., the results improved up to 36%. Good classification results were attained (experiment 7) using breast and prostate cancer data set, but this experiment used the original size of all data sets – BrCa-13, PrCa-52 and HD-155, and the last class dominated in the data. To even out the results, the number of records belonging to class HD was randomly decreased to 76 records; the new results (experiment 8) were slightly worse because of objective reasons. Then the results were tested by merging both cancer classes of the experiment 8 into one class. The results improved (see experiment 9) showing that distinguishing between a healthy individual and a cancer patient is easier than differentiating between types of cancer 1 Time taken to build model in seconds that affect the individual. Experiment 11 does not confirm these results but there is a valid reason – the cancer class dominated among the records of the used data set. Another conclusion can be drawn from the results – types of cancer have lots in common because the classification accuracy is better for cancer sets than the classification result of all diseases (experiments 11 and 12). To verify the classification efficacy of different diseases without healthy donor class for comparison, experiment 13 was conducted that gave an identical result to experiment 10, therefore it can be concluded that inclusion or removal of healthy donor data does not change the classification accuracy. The depiction of accuracy/errors obtained in each experiment is shown in Fig.3. This type of diagram provides an easier means to visually evaluate the classification accuracy and error calculations. It is apparent that the highest accuracy/smallest error is in experiments 2, 5, 7 and 9. 100% 80% 60% Accuracy 40% Error 20% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Number of experiment Fig.3. Classification accuracy and error To test the results achieved by FURIA algorithm, five complementary experiments were performed using FLR [12] algorithm (see 14-18). As can be seen, the results differ only slightly and algorithm FLR shows even better results because the number of rules in each experiment is smaller meaning faster classification. However FURIA would probably be superior in classification of new records eventually because of its advantage using rule stretching that supports classification of new, previously unseen records ensuring greater classification accuracy. 3.3 Classification results with most popular data sets To test the classification results obtained with BMC data, a series of experiments was conducted using bioinformatics data sets frequently used in related literature. It was undertaken to compare the obtained classification results. All data sets were normalized using the normalization tool built into WEKA software. The summarized classification results using 10 fold – cross-validation for all six data sets using five different classification methods are shown in Table 4. Table 4. Experiments with most popular data sets Lymphoma Lung tumor ML leukemia Prostate Childhood tumors Number of rules 3 4 9 4 6 5 Time in seconds 3.68 6.67 68.14 5.33 23.02 3.8 Accuracy 83% 77% 90% 81% 81% 92% Error 17% 23% 10% 19% 19% 8% Number of used genes 5147 7070 12600 12533 12533 2308 Number of rules 4 3 6 3 3 5 Accuracy 94% 91% 88% 85% 91% 92% Error 6% 9% 12% 15% 9% 8% Number of used genes 717 483 1185 717 483 951 Algorithm/ author FURIA [2] Leukemia Ho et al. [13] Data set name Prostate 92% 8% 94% 6% Childhood tumors 8 76% 24% 2 8 200 8 8 8 OhnoMachado et al. [14] Mramor et al. [15] Vinterbo et al. [16] Mramor et al. [15] Mramor et al. [15] OhnoMachado et al. [14] Leukemia Lymphoma Number of rules Accuracy Error Number of used genes 2 79% 21% Algorithm/Author Algorithm/ author See last row ML leukemia 91% 8% Lung tumor 21 99% 1% Data set name Table 4 demonstrates that two classification methods (FURIA and Ho et al. method) have results for all six data sets in the literature; but the last block compiles results of three different methods. Comparing FURIA and Ho et al. method based on the number of induced rules, it is evident that FURIA using a larger number of rules acquires an identical or a slightly worse result. However it should be noted that experiments with the FURIA algorithm used the full data sets but Ho et al. in their turn implemented feature selection and used attribute sets that were up to 25 times smaller. If the results are compared based on the data set used and the results obtained it becomes apparent that the results for Leukemia and Lung tumor data sets are adequate to the number of the used rules and the number of attributes that were used in classification. Therefore FURIA shows average result. Lymphoma data set, as well as ML leukemia and Prostate data set classification results using FURIA algorithm are worse than the accuracies of other methods but this is partially due to the aforementioned feature selection. In Chilhood tumors data set, the results of FURIA are identical to those of Ho et al. method but the third method demonstrates a worse result. The acquired classification accuracy using six popular data sets and three methods is depicted in Fig.4. It is evident that the FURIA method analyzed in this study achieves competitive results and the attained classification accuracy in all experiments is above 75%. 100% 90% 80% 70% 60% 50% Ho et al. 40% FURIA 30% Other 20% 10% 0% Leukemia Lymphoma Lung tumor ML leukemia Prostate Childhood tumors Fig.4. Classification accuracy 4 Conclusions The theoretical part of the study analyzes the method proposed by Hühn et al. and gives a short overview of the FURIA algorithm. The execution of this study included making comparative experiments using the previously described disease and healthy individual data sets. It can be concluded that: • Data normalization in interval [0,1] does not give a significant improvement of classification results; • Classification accuracy can be increased by replacing the missing values with the mean attribute value of the same class; • Neither class should be dominant to obtain adequate results, they should have approximately equal number of records; • Greater efficiency can be achieved in a data set where each type of the cancer is replaced by one class ‘cancer’, i.e., it is easier to distinguish between cancer patients and healthy patients without the notion of cancer type. The classification results obtained by using data sets that are frequently employed in the corresponding literature suggest that the FURIA algorithm is competitive and has good classification accuracy. To evaluate classification accuracy, it is crucial to consider the number of used attributes and the number of induced rules that classify the data set. It can be considered that the rule-stretching feature of FURIA could provide a significant advantage in data sets where there would be a need to classify records that differ from those used in training. The possible directions of future work include continuing research using FURIA algorithm to determine the results that can be achieved with other frequently used bioinformatics data sets. It is also planned to proceed with research using other fuzzy techniques and algorithms that could be used in bioinformatics and, particularly, to solve the tasks of this study using cancer, healthy donor and gastric intestinal disease data classification, as well as to use feature selection and evaluate its impact on the obtained classification result. Acknowledgements: This work has been supported by the European Social Fund within the project «Support for the implementation of doctoral studies at Riga Technical University». This work has been developed in LATVIA – BELORUS Co-operation programme in Science and Engineering within the project «Development of a complex of intelligent methods and medical and biological data processing algorithms for oncology disease diagnostics improvement», Scientific Cooperation Project No. L7631. Thanks to Dr.habil.sc.comp. Professor Arkady Borisov (Riga Technical University) for help and support. References: [1] Gasparoviča M., Novoselova N., Aleksejeva L., Using Fuzzy Logic to Solve Bioinformatics Tasks, Proceedings of Riga Technical University. Issue 5, Computer Science. Information Technology and Management Science, Vol.44, 2010, pp.99-105. [2] Hühn J., Hüllermeier, E., FURIA: An Algorithm for Unordered Fuzzy Rule Induction, Data Mining and Knowledge Discovery, Vol.19, No.3, 2009, pp.293-319. [3] Cohen W. W., Fast Effective Rule Induction. Proceedings of the Twelfth International Conference on Machine Learning, 1995, pp. 115-123. [4] KalniĦa Z., et al., Evaluation of T7 And Lambda Phage Display Systems for Survey Of Autoantibody Profiles in Cancer Patients, Journal of Immunological Methods, May 20, Vol. 334, No.1-2, 2008, pp.37-50. [5] Golub T.R., Slonim D.K. et.al., Molecular Classification of Cancer: Class Discovery and Class prediction by Gene Expression Monitoring, Science., Vol. 286, 1999, pp. 531-537. [6] Khan J. et al., Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks, Nature Medicine, Vol. 7, No. 6, June 2001, pp.673-679. [7] Singh D. et al., Gene Expression Correlates of Clinical Prostate Cancer Behavior, Cancer Cell, Vol. 1, No. 2, March 2002, pp.203-209. [8] Shipp M.A., et al., Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene-Expression Profiling and Supervised Machine Learning. Nat Med., Vol.8, No.1, January 2002, pp. 68-74. [9] Bhattacharjee A., et al., Classification of Human Lung Carcinomas by MRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses, Proc. Natl. Acad. Sci. U.S.A. Vol. 98, No.24, 2001, pp.13790–13795. [10] Armstrong A.S. et al., MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia, Nature Genetics, Vol.30, 2001, pp. 41 – 47. [11] Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H., The WEKA Data Mining Software: An Update; SIGKDD Explorations, Vol. 11, No 1, 2009, pp.10-18. [12] Kaburlasos V. G., Athanasiadis I. N., Mitkas P.A., Fuzzy Lattice Reasoning (FLR) Classifier and Its Application for Ambient Ozone Estimation, International Journal of Approximate Reasoning, Vol.45, No.1, 2007, pp.152-188. [13] Ho S.-Y., Hsieg C.-H., Chen H.-M., Huang. H.-L., Interpretable Gene Expression Classifier With an Accurate and Compact Fuzzy Rule Base for Microarray Data Analysis, BioSystems, Vol. 85, 2006, pp.165-176. [14] Ohno – Machado L., Vinterbo S., Weber G., Classification of Gene Expression Data Using Fuzzy Logic, J. Intell. Fuzzy Syst., Vol. 12, 2002, pp. 19-24. [15] Mramor M., Leban G., Demšar J., Zupan B., Visualization-based Cancer Microarray Data Classification Analysis, Bioinformatics, Vol. 23, No.16, 2007, pp. 2147-2154. [16] Vinterbo S.A., Kim E.-Y., Ohno – Machado L., Small, Fuzzy and Interpretable Gene Expression Based Classifiers, Bioinformatics, Vol. 21, No. 9, 2005, pp. 1964-1970.