Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proc. Int. Conf. on Computational Intelligence and Information Technology, CIIT Competence and Performance-Improving approach for maintaining Case-Based Reasoning Systems Abir Smiti and Zied Elouedi LARODEC, Université de Tunis, Institut Supérieur de Gestion de Tunis, 41 Avenue de la liberté, cité Bouchoucha, 2000 Le Bardo, Tunisia, [email protected], [email protected] Abstract. The competence and the performance of Case Based Reasoning (CBR) system depend on the quality of the case base (CB) and the speed of the retrieval process that can be costly in time especially when the number of cases gets bulky. To guarantee the system's quality, maintaining the contents of a CB becomes unavoidable. In this paper, we present a novel approach for automatically maintaining a CB to improve the competence and the performance of the CBR system. Our policy is mainly based on clustering and competence computing. We support our approach with empirical evaluation. Keywords: Case base maintenance, Clustering, DBSCAN, Gaussian Means, Competence, Coverage, Density 1 Introduction Case Based Reasoning [1, 2] is a diversity of reasoning by analogy. To solve the problems, CBR system calls the past cases, it reminds to the similar situations already met. Then, it compares them with the current situation to build a new solution which, in turn, will be added to the case base (CB). A CBR system is built to work for long periods of time, it adds cases to the CB through the retain process. As a result, the CB can grow very fast in the sense that it can affect negatively the CBR's quality results. To ensure the system's betterment, maintaining CBR systems (CBM) becomes required. Recently, the CBM issue has drawn more and more attention to two major gauges that supply to the evaluation of a CB. The first one is the CB's performance [4, 9] which is the answer time that is needed to calculate a solution for case targets. The second one is the CB's competence [6] which is the range of target problems that can be successfully solved. In order to create a high quality of CB, we need a CBM strategy, that offers a CB with small size, eliminates disagreeable cases such as noises as well it is able to increase the classification accuracy and improving the competence. This paper presents a novel approach for automatically maintaining CBs, while improving both the competence and the performance of the CBR system. We name it Clustering, competence Model using Coverage for Deletion method (CMCD). The rest of the paper is organized as follows: In Section 2, competence model will be approached. Section 3 describes our new approach CMCD. Section 4 analyzes experimental results. © Elsevier, 2012 231 2 Related Works on CBM: Competence Model The motivation of this paper is inspired by (M&S) [3, 5, 8, 12]. It is assuming that the competence is based on a number of factors including the size and the density of cases. The density of an individual case (Dens) can be defined as the average similarity between this case (c) and other clusters of cases called competence groups (Equation 1). Hence, the density of a cluster of cases is measured as a whole as the average local density over all cases in the group (G) (Equation 2). The coverage of each competence group (Cov) is then measured by the group size and density (Equation 3). In the final step, the overall competence of the case base is simply the sum of the coverage of each group. (1) (2) (3) Where |G| is the number of cases in the group G. 3 Clustering, competence Model using Coverage for Deletion Method (CMCD) In order to have a good CB quality, we should keep cases whose deletions directly reduce the competence of the system. Suppose that we have a group G of similar cases { x1 , x2 ,..xnG }. Each case, in this group, has a different value of coverage Cov i and there is a case in this set has the maximum value of coverage Cov max . An interesting question is "what is the best value that the group can totally have?" The answer is when each case of G has coverage's value equal to Covmax . Hence, the best value of coverage in G is defined as following: (4) Hence, from Equation 3, we can define the best value of the density group as: (5) 232 Hence, in order to obtain the "best" subset of cases from the group G, its Dens(G) should be equal to DensBest(G) value. As a deduction, we can keep only cases whose the sum of their densities satisfies this value. (6) To apply our idea described above, we need first to create multiple, groups from the CB that are located on different sites. Each group contains cases that are closely related to each other. In that way, we can define the coverage group. This can be done by a clustering technique. For each cluster, the cases whose sum of their densities equals to the best density of the group, which in their turns have the best coverage, are kept and the rest of cases is removed. Therefore, we obtain a new small CB with high competence. The basic process of our proposed CMCD maintaining method is in the following steps: 1. Clustering cases: a CB is decomposed into groups of closely related cases. 2. For each cluster: we calculate the coverage for each case, select cases which have the best coverage according to their density, and delete the other cases. Step 1: Clustering cases using DBSCAN-GM Among several clustering approaches, we should, ideally, use a clustering method that discovers structure in such data sets and has the maximum of the following main properties: It is automatic and it has the capability of handling noisy cases to delete them. To overcome all these conditions, we use a new clustering method called "DBSCANGM" [17]. It combines Gaussian Means [19, 20] and density-based clustering method (DBSCAN) [18] methods. DBSCAN-GM clustering method benefits from the advantages of both algorithms to cover the conditions cited above: The first stage, DBSCAN-GM runs Gaussian-Means to generate automatically a set of clusters with their centers, in purpose to estimate the parameters of DBSCAN. In this manner, the parameter identification problem of DBSCAN is solved. The second stage, it runs DBSCAN with their determined parameters to handle noises and discover arbitrary size and shaped clusters. In this fashion, the noisy data shortcoming of Gaussian-Means is unraveled. Step 2: Competence model and coverage computing Once we have partitioned the original case memory by DBSCAN-GM, our CMCD deletes noisy cases. Based on a partitioned CB, and the deletion of noisy cases, the smaller case bases are built on the basis of clustering result with high quality. Each cluster is considered as a small independent CB. For each cluster, the competence model is applied to remove the cases with low coverage. In order to find the "best" subset of cases which can realize the highest competence value, we compute for each case in one cluster its coverage using the similarity measure. A case is considered to be 40 significant in the CB, if it solves many similar cases: its similarity value should be greater than a verge . The concept of case coverage as follows: (7) 233 For each cluster Gi, we compute the maximum coverage of the group (Equation 4) and the maximum density (Equation 5). We sort cases, in descending order, according to their individual densities values, we select only cases whose sum of their densities nears to the maximum density of the group, and we remove the other cases. In this way, we obtain only cases with the best coverage values. 4 Results and Analysis In order to evaluate the performance rate of CMCD, we test on ten diverse data sets with different sizes. In this paper, we use public datasets obtained from the U.C.I. repository of Machine Learning databases [7]: Iris, Ionosphere, Breast-W, Blood-T, Indian and Vehicle. We will consider the following principal criteria: "Size (S%)" the average storage percentage, "PCC (PCC%)" the mean percentage of correct classification over stratified tenfold cross validation runs in front of 1-NearestNeighbor (1NN) and "Time" the retrieval time in seconds exerted in 1NN. Our CMCD method can be appreciated when compared with other well-known reduction techniques. Thus, we run the WCOID [11], COID [10], CNN [14], RNN [16], ENN [15] and IBL schemes [13] on the previous data sets. From Tables 1, 2, and 3 we observe that the experimental results obtained by using our CMCD method are remarkably better than the ones provided by the other policies: Table 1. Comparing storage size of CMCD to well known reduction schemes Dataset CBR CMCD WCOID COID CNN RNN ENN IB2 IB3 IR-150 IO-351 BW-698 BT-748 IN-768 V-846 MM-961 C-1023 Y-1483 AB-4176 100 100 100 100 100 100 100 100 100 100 52,00 29,34 25,50 12,03 3,12 17,77 17,48 11,57 51,30 25,86 57,33 13,39 9,46 20,45 15,34 20,80 51,00 32,45 8,07 30,67 27,63 15,30 16,30 37.30 43,66 43,62 64,21 62,30 12,34 38,50 93,33 83,91 16,87 38.72 42,08 43,63 54,26 72,89 14,02 58,55 95,33 86,89 81,69 32,10 24,39 76,48 82,52 58,46 69,59 87,50 24,00 25,07 35,48 26,09 22,12 46,57 53,48 84,85 69,79 51,92 24,00 25,00 30,46 26,00 21,00 50,73 53,93 87,85 69,98 51,92 47,30 13,96 6,59 19,00 16,00 51,89 53,18 39,29 8,22 35,58 234 Table 2. Comparing classification accuracy of CMCD to well known reduction schemes Dataset IR-150 IO-351 BW-698 BT-748 IN-768 V-846 MM-961 C-1023 Y-1483 AB-4176 CBR 97,33 93,16 97,99 78,81 83,92 82,03 84,92 61,24 86,16 97,93 CMCD WCOID COID 98,98 96,56 96,94 98,57 97,77 98,90 98,15 95,45 98,01 99,97 79,56 79,12 97,97 94,71 92,06 99,99 95,45 87,24 89,98 89,12 89,31 97,13 96,22 96,54 97,56 88,10 86,78 98,61 96,44 96,19 CNN 73,00 70,83 68,18 67,94 67,40 57,58 70,82 82,92 83,56 68,75 RNN 94,23 70,83 67,05 66,65 69,19 57,45 78,60 89,43 83,92 62,50 ENN 91,60 98,36 94,66 71,63 99,14 82,45 77,04 70,40 88,08 96,70 IB2 91,67 95,45 69,69 74,69 87,39 74,44 66,28 59,16 73,82 91,20 IB3 91,67 94,89 70,56 74,21 88,26 73,73 66,42 59,50 73,38 91,67 Table 3. Comparing retrieval time of CMCD to well known reduction schemes Dataset IR-150 IO-351 BW-698 BT-748 IN-768 V-846 MM-961 C-1023 Y-1483 AB-4176 CBR 0,0163 0,0724 0,1043 0,0902 0,1581 0,2521 0,3157 0,5324 0,3682 0,4023 CMCD 0,014 0,0352 0,0341 0,0078 0,0044 0,0044 0,0607 0,0051 0,3449 0,0706 WCOID 0,0244 0,5500 0,1160 0,0212 0,0102 0,0755 0,0664 0,0671 0,5440 0,1890 COID 0,0252 0,5800 0,1340 0,0760 0,0263 0,0771 0,0532 0,0536 0,7710 0,1570 CNN 0,011 0,006 0,430 0,098 0,050 0,064 0,208 1,802 0,640 0,111 RNN 0,0101 0,8100 0,3500 0,1830 0,0670 0,0604 0,1990 1,8420 0,6040 0,1010 ENN 0,013 0,616 0,734 0,194 0,026 0,134 0,815 0,931 0,134 0,137 IB2 0,0024 0,0122 0,2440 0,2035 0,0102 0,0595 0,3390 0,2079 0,5950 0,2400 IB3 0,0026 0,1020 0,2270 0,1970 0,0092 0,0581 0,0327 0,1746 0,5810 0,2600 5 Conclusion and future work In this paper, we have proposed a case-base maintenance method named CMCD Clustering, competence Model using Coverage for Deletion method- which is able to maintain the case bases by improving the performance and the competence of the CBR system. Our deletion policy can be improved in future works by introducing weighting methods in order to check the reliability of our reduction policy. References 1. Aamodt A. and Plaza E.: Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. In Artificial Intelligence Communications, vol. 7, pp. 39-52 (1994) 2. Leake, D.B., Wilson, D.C.: Maintaining Case-Based Reasoners: Dimensions And Directions. In: Computational Intelligence. vol. 17, pp. 196-213 (2001) 235 3. Smyth, B., Keane, M.T.: Remembering To Forget: A Competence-Preserving Case Deletion Policy for Case-Based Reasoning. In: the 14th International Joint Confer-ence on Arti cial Intelligent, pp. 377-382 (1995) 4. Smyth, B., McKenna, E.: Building Compact Competent Case-Bases, In: Case-Based Reasoning Research and Development: Proceedings of the Third International Conference on Case-Based Reasoning, pp. 329-342 (1999) 5. Smyth, B., Mckenna, E.: Competence guided incremental footprint-based retrieval. In: Journal of Knowledge-Based Systems, vol. 14, pp. 155-161 (2002) 6. Surma, J., Tyburcy, J.: A Study on Competence-Preserving Case Replacing Strategies in Case-Based Reasoning. In: Advances in Case-Based Reasoning, Proceedings of the 4th European Workshop on Case-Based Reasoning, EWCBRY98, Dublin, Ireland. SpringerVerlag, vol. 1488, pp. 233-238 (1998) 7. Asuncion, A., Newman D.J., UCI Machine Learning Repository. http://www.ics.uci.edu/mlearn. In: University of California, Irvine, School of Information and Computer Sciences (2007) 8. Smiti, A., Elouedi, Z.: Overview of Maintenance for Case based Reasoning Systems. In: International Journal of Computer Applications. Published by Foundation of Computer Science, New York, USA. vol.32, pp. 49-56 (2011) 9. Yang, Q., Wu, J.: Keep it simple: A case-base maintenance policy based on clustering and information theory. In: Canadian Conference on AI, pp. 102-114 (2000) 10.Smiti, A., Elouedi, Z.: COID: Maintaining case method based on Clustering, Outliers and Internal Detection. In: book chapter in Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, SNPD'10, Springer Berlin / Heidelberg, vol.295, pp. 39-52 (2010) 11.Smiti, A., Elouedi, Z.: WCOID: Maintaining case-based reasoning systems using Weighting, Clustering, Outliers and Internal cases Detection. In: The Eleventh International on Intelligent Systems Design and Applications, pp. 356-361 (2011) 12.McKenna, E., Smyth, B.: A Competence Model for Case-Based Reasoning. In: 9th Irish Conference on Artificial Intelligence and Cognitive Science, pp: 208-220 (1998) 13.Aha, D., W., Kibler, D., Albert, M., K.: Instance-based learning algorithms. In: Machine Learning, Springer Netherlands, vol.6, pp. 37-66 (1991) 14.Chou, C., H., Kuo, B., H., Chang, F.:The Generalized Condensed Nearest Neigh-bor Rule as A Data Reduction Method. In: International Conference on Pattern Recognition, IEEE Computer Society, vol.2, pp. 556-559 (2006) 15.Wilson, D., L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. In: Transactions on Systems, Man and Cybernetics, vol.2, pp. 408-421 (1972) 16.Li, J., Manry, M., T., Yu, C., Wilson, D., R.: Prototype Classifier Design with Pruning. In: International Journal on Arti cial Intelligence Tools, pp. 261-280 (2005) 17.Smiti A. and Elouedi Z.: DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques, In: International Conference on Intelli-gent Engineering Systems (INES), IEEE Computer Society, pp. 573-578 (2012) 18.Ester, M., Kriegel, H., P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceeding of 2nd International Conference on Knowledge Discovery, pp. 226-231 (1996) 19.Hamerly, G., Elkan, C.: Learning the k in k-means. In: MIT Press, vol.17 (2003) 20.MacQueen, J., B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceeding of the fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, vol.1, pp. 281-297 (1967) 236