Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Software Engineering and Its Applications Vol. 4, No. 2, April 2010 2010 Generating Better Radial Basis Function Network for Large Data Set of Census Hyontai Sug Division of Computer & Information Engineering, Dongseo University, Busan, 617-716, Korea [email protected] Abstract Radial basis function networks are known to have good performance compared to other artificial neural networks like multilayer perceptrons. Because the size of target data sets in data mining is very large and artificial neural networks including radial basis function networks require intensive computing, sampling is needed. So, because the sample size should be relatively small due to computational load to train radial basis function networks, simple random sampling for small size might not generate perfect and balanced samples. This paper suggests a better sampling technique based on branching information of decision tree for radial basis function networks when target data set is very large like census data. Experiments with census income data in UCI machine learning repository shows a promising result. Keywords: Data mining, census, radial basis function networks. 1. Introduction Artificial neural networks have been developed eagerly as a subfield of machine learning for decades, and many good results have been reported [1]. There are many kinds of artificial neural networks that have been reported very successful [2, 3, 4, 5, 6]. Among them radial basis function networks (RBFNs) are reported to have better performance than other neural networks for some domain [7, 8]. Traditionally researchers of machine learning do not treat data sets in big size, so is true for artificial neural networks. On the other hand, the main concern of researchers in data mining is to find some hidden important knowledge from very large data sets. So, most data mining systems based on artificial neural networks rely on some sampling [9]. In data mining field, neural networks are used mostly for prediction tasks. So, neural networks with the smallest error rates for a given data set have been a major concern for their success. But even though neural networks are one of the most successful data mining and machine learning methodologies, there are some points of improvement with respect to performance due to the fact that they are built based on greedy algorithms and mostly by the knowledge of experts. Some known weak points of RBFNs are weak performance with irrelevant features and error data. On the other hand, real world data often contain these characteristics. So we might have some difficulty in applying RBFNs for data mining. Moreover, because most target databases for data mining are very large, we might need random sampling process to the target databases to train the neural networks. But random sampling might not generate perfect samples that are well balanced with respect 15 International Journal of Software Engineering and Its Applications Applications Vol. 4, No. 2, April 2010 2010 to the whole population, so the found knowledge models based on random samples are exposed to possible sampling errors. An alternative strategy may be to use the original database. But, it may not be a good idea, since it may be computationally very expensive and generated neural networks may not be good enough, so that they have almost no improvement in error rates. In this paper we want to find a better way of sampling for large data sets like census when our target artificial neural network is RBFN. This is an extended version of the paper for ICHIT’09 [10]. In section 2, we provide the related work to our research, and in sections 3 we present our method. Experiments were run to see the effect of the method in section 4. Finally section 5 provides some conclusions. 2. Related work Artificial neural networks have been very successful in the field of machine learning after a pioneering book ‘Parallel Distributed Processing’ [11]. There are two kinds of neural networks based on how the networks are interconnected – feed-forward neural networks and recurrent neural networks [12]. RBFNs are one of the most popular feedforward networks. Even though RBFNs have three layers including the input layer like multilayer perceptrons, they differ from them, because in RBFNs the hidden units perform some computation. Many researchers have reported successful application of RBFNs [13, 14, 15]. On the other hand, decision tree algorithms are also one of the representative data mining methods. There have been a lot of efforts to build better decision trees with respect to error rates. As a way to achieve this goal, many splitting criteria have been invented. For example, one of the representative decision tree algorithms, C4.5 [16] uses entropy-based measure, while CART [17] uses purity-based measure. C4.5 generates decision trees in quick and dirty manner, while CART spends more time to generate more optimized decision trees. There have been also scalability related efforts to generate decision trees for large databases with intention to apply for data mining. For example, SLIQ [18], SPRINT [19], and PUBLIC [20] are some scalable decision tree algorithms. SLIQ saves computing time especially when a data set contains lots of continuous attributes by using a pre-sorting technique in tree-growth phase. SPRINT is an improved version of SLIQ to solve the scalability problem by building trees in parallel. PUBLIC tries to save some computing time by integrating the steps of pruning and generating branches. There is also research on sample size [21, 22], the property of samples [23] as well as sampling method [24]. In paper [21] the effect of sample size is discussed for parameter estimates in a family of functions for classifiers. In paper [22] small-sized samples are preferred for feature selection and error estimation for several classifiers. In paper [23] the authors showed that class imbalance in training data has effects in neural network development especially for medical domain. In [24] several re-sampling techniques like cross-validation, the leave-one-out, etc. are tested to see the effect of the sampling techniques in the performance of neural networks, and discovered that the re-sampling techniques have very different accuracies depending on feature space and sample size. 16 International Journal of Software Engineering and Its Applications Vol. 4, No. 2, April 2010 2010 3. Suggested method Because our target data set is relatively very large, and we don’t want to spend too much time for sampling, so the method first builds a decision tree using some fast decision tree generation algorithms like C4.5. Then, we do random sampling based on the structure of the generated decision tree. We do random sampling for each branch of the decision tree, and the size of sample is dependent on the number of training objects in the branch of the decision tree. In other words, in order to find reasonable sample sizes we use the information of the number of training objects in the terminal nodes of the generated decision tree. The number of training examples in the terminal nodes of the decision tree could be small, so in order to get big enough data sets for random sampling, we integrate sibling terminal nodes together until the number of training objects reaches some predefined limit. The predefined limit should not only be large enough to do the random sampling, but also should be small enough to divide the training objects into proper sizes. The following is the steps of the method ---------------------------------------------------------------------------------------------------------------------------1. Generate a decision tree with a fast algorithm for the whole data set; 2. j := 0; 3. Do Do Integrate sibling terminal nodes of the decision tree in bottom-up and left-t-right manner; While the number of training objects in the integrated node < predefined limit; j++; Let the training objects be Dj; Until there is no node to visit; the_number_of_sampling_groups := j; 4. For i := the_number_of_sampling_groups Do Do random sampling of size k for each Di where k = (target sample size) × |Di| / | ∑ Di |; Let the sample set be Si; End do 5. Integrate all the random samples Si (i=1~ the_number_of_sampling_groups); ---------------------------------------------------------------------------------------------------------------------------- The integrated final samples will be used to train RBFNs. 17 International Journal of Software Engineering and Its Applications Applications Vol. 4, No. 2, April 2010 2010 4. Experiments 'census-income' in UCI machine learning repository [25] is a census data set. It has 199,523 objects for training, and 99,762 objects for testing. There are 41 attributes and 8 of them are continuous-valued attributes. Class probabilities for class values -50000 and 50000+ are 93.8% and 6.2% respectively. C4.5 [16] was used to generate a decision tree for the whole training data. Because the data set has continuous attributes, the entropy-based discretization [26] was performed before generating the tree to generate the decision tree rapidly. The generated tree has 1,821 nodes, and the total number of leaves is 1,661. After having generated the decision tree, our sampling method has been applied. The given criterion of the predefined limit of the number of training examples for node integration is 30,000 for the experiment. Table 1 shows the groups divided by the suggested method. In the table X = '(x 1, x 2]'means x 1<X≤ x2. Table 1. Groups of objects Group no. Property of objects The number of objects 1 Capital_gains = '(-∞, 57]' & dividends_from_stocks = '(-∞, 0.5]' & weeks_worked_in_year = '(-∞, 0.5]' 89,427 2 Capital_gains = '(-∞, 57]' & dividends_from_stocks = '(-∞, 0.5]' & weeks_worked_in_year = '(0.5, 51.5]' 28,617 3 Capital_gains = '(-∞, 57]' & dividends_from_stocks = '(-∞, 0.5]' & weeks_worked_in_year = '(50.5, ∞]' & capital_loses = '(-∞, 1,881.5]' & sex = Female 25,510 4 Capital_gains = '(-∞, 57]' & dividends_from_stocks = '(-∞, 0.5]' & weeks_worked_in_year = '(50.5, ∞]' & capital_loses = '(-∞, 1,881.5]' & sex = Male 28,410 5 Capital_gains = '(-∞, 57]' & dividends_from_stocks = '(-∞, 0.5]' & weeks_worked_in_year = '(50.5, ∞]' & capital_loses = '(1,881.5, -∞]' 1,132 6 Capital_gains = '(-∞, 57]' & dividends_from_stocks = '(0.5, ∞]' 19,048 7 Capital_gains = '(57, ∞]' 7,379 Table 2~5 summarizes the results of accuracy of RBFNs with conventional and suggested sampling method for several sample sizes. Sampling is performed four times for each sample size of 1,000, 2,000, 5,000, and 10,000. The test has been done with the test data. The used radial basis function is Gaussian. 18 International Journal of Software Engineering and Its Applications Vol. 4, No. 2, April 2010 2010 Table 2. Accuracy of RBFNs for conventional and suggested method with sample size 1,000 Sample size: 1,000 Average Accuracy in conventional method (%) Accuracy in suggested method (%) 93.5857 93.5958 94.1511 93.4865 93.1006 93.9035 93.7802 93.5396 93.6544 93.6321 Table 3. Accuracy of RBFNs for conventional and suggested method with sample size 2,000 Sample size: 2,000 Average Accuracy in conventional method (%) Accuracy in suggested method (%) 93.5476 93.4614 93.6439 93.5948 94.6503 93.7992 93.3412 93.6970 93.79573 93.6336 Table 4. Accuracy of RBFNs for conventional and suggested method with sample size 5,000 Sample size: 5,000 Average Accuracy in conventional method (%) Accuracy in suggested method (%) 93.6870 94.0208 94.1491 93.7762 93.6168 94.1621 93.8534 93.5166 93.82658 93.86893 Table 5. Accuracy of RBFNs for conventional and suggested method with sample size 10,000 Sample size: 10,000 Average Accuracy in conventional method (%) Accuracy in suggested method (%) 93.5707 93.9666 93.7311 94.0739 93.7992 93.9015 93.9596 93.9356 93.76515 93.9694 19 International Journal of Software Engineering and Its Applications Applications Vol. 4, No. 2, April 2010 2010 If we compare accuracies of RBFNs for each sample size in the tables, the performance of our sampling method is poorer than conventional sampling method for smaller sample sizes like 1,000, and 2,000. But, the trend is reversed for larger sample sizes like 5,000 and 10,000. In other words the conventional sampling method has the property of some logarithmic increase. But the suggested method generates RBFNs of accuracy in some monotonic increase. All in all, while conventional sampling method cannot improve the performance of RBFNs much as the sample size grows, our sampling method improves the performance much as the sample size grows. 5. Conclusions There are many kinds of artificial neural networks that have been reported very successful. Among them radial basis function networks (RBFNs) are reported to have better performance than other neural networks for some domain. Traditionally researchers of machine learning do not treat data sets in big size, so is true for artificial neural networks. On the other hand, the main concern of researchers in data mining is to find some hidden important knowledge from very large data sets. So, most data mining systems based on artificial neural networks rely on some sampling. Because most target databases for data mining are very large, we need random sampling process to the target databases to train the neural networks. But conventional random sampling might not generate perfect samples that are good for RBFNs, because some known weak points of RBFNs are weak performance with irrelevant features and error data. But real world data often contain these characteristics. So we might need some better way of sampling for RBFNs to be applied for data mining. This paper showed a better way of sampling for large data sets like census when our target artificial neural network is RBFN. In order to train RBFNs for a target data set of very large size, in this paper a sampling method that considers the upper structure of decision tree is suggested. By dividing the target data set into several groups of different sizes based on where an object belongs to the decision tree and picking up different number of objects randomly from each group based on the size of group, the method exploit the structure of decision trees. For experiment census income data set in UCI machine learning repository were selected. Experiments showed that when sample size is relatively small, conventional sampling method is better, but when sample size is relatively large, suggested sampling method showed better result. References [1] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University press, 1995. [2] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning Internal Representation by Error Propagation, In Parallel Distributed Processing, vol. 1, Rumelhart, D.E., McClelland, J.L. Eds., The MIT Press, 1986. [3] J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, In Proceedings of National Academic Science, vol. 79, pp. 2554-2588, 1982. [4] G.A. Carpenter, S. Grossberg, ART-3: Hierarchical Searching Using Chemical Transmitters in SelfOrganizing Pattern Recognition Architectures, Neural Networks, vol. 3, pp. 129-152, 1990. [5] G.E. Hinton, T.J. Sejnowski, D.H. Ackley, Boltzmachines: Constraint satisfaction networks that learn, Carnegie-Mellon University, Technical Report CMU-CS-84-119, 1984. [6] K. Fukushima, S. Miyake, T. Ito, Neocognitron: A neural network model for a mechanism of visual pattern recognition, IEEE Transactions on Systems, Man and Cybernetics, vol. 3, no. 5, pp. 826-834, 1983. 20 International Journal of Software Engineering and Its Applications Vol. 4, No. 2, April 2010 2010 [7] L. Nikolaos, Radial basis Function Networks to Hybrid Neuro-Genetic RBFNs in Financial Evaluation of Corporations, International Journal of Computers, vol. 2, issue 2, pp. 176-183, 2008. [8] A. Hofmann, B. Sick, Evolutionary Optimization of Radial Basis Function Networks for Intrusion Detection, Proceedings of the International Joint Conference on Neural Networks, Vol. 1, pp. 415420, 2003. [9] P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining. Addison Wesley, 2006. [10] H. Sug, Sampling Scheme for Better RBF network, Proceeding of International Conference on Convergence & Hybrid Information Technology, pp. 413-416, 2009. [11] D.E. Rumelhart, J.L. McClelland, eds., Parallel Distributed Processing, The MIT Press, vol. 1, 1984. [12] P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2006. [13] G. Baylor, E.I. Konukseven, A.B. Koku, Control of a Differentially Driven Mobile Robot Using Radial Basis Function Based Neural Networks, WSEAS Transcations on Systems and Control, vol. 3, issue 12, pp. 1002-1013, 2008. [14] A. Esposito, M. Marinaro, D. Oricchio, S. Scarpetta, Approximation of Continuous and Discontinuous Mappings by a Growing Neural RBF-based Algorithm, Neural Networks, Vol. 13, No. 6, pp. 651-656, 2000. [15] O. Buchtala, M. Klimek, B. Sick, Evolutionary Optimization of Radial Basis Function Classifiers for Data Mining Applications, IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 35, No. 5, pp. 928-947, 2005. [16] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., 1993 . [17] L Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth International Group, 1984. [18] M. Mehta, R. Agrawal, J. Rissanen, SLIQ : A Fast Scalable Classifier for Data Mining, Proceedings of EDBT'96, Avignon, France, 1996. [19] J. Shafer, R. Agrawal, M. Mehta., SPRINT : A Scalable Parallel Classifier for Data Mining, Proceedings of Int. Conf. Very Large Data Bases, Bombay, India. 544-555, 1996. [20] R. Rastogi, K. Shim, PUBLIC : A Decision Tree Classifier that Integrates Building and Pruning, Data Mining and Knowledge Discovery, vol. 4, no. 4. Kluwer International, 315-344, 2002. [21] K. Fukunaga, R.R. Hayes, Effects of Sample Size in Classifier Design, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, issue 8, pp. 873-885, 1989. [22] S.J. Raudys, A.K. Jain, Small Sample Size Effects in Statistical Pattern recognition: Recommendations for Practitioners, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264, 1991. [23] M.A. Mazuro, P.A. Habas, J.M. Zurada, J.Y. Lo, J.A. Baker, G.D. Tourassi, Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance, Neural Networks, vol. 21, issues 2-3, pp. 427-436. 2008. [24] S. Berkman, H. Chan, L. Hadjiiski, Classifier performance estimation under the constraint of a finite sample size: Resampling scheme applied to neural network classifiers, Neural Networks, vol. 21, issues 2-3, pp. 476–483, 2008. [25] D. Newman, UCI KDD Archive [http://kdd.ics.uci.edu], Irvine, CA: University of California, Department of Information and Computer Science, 2005. [26] H. Liu, F. Hussain, C.L. Tan, M. Dash , Discretization: An Enabling Technique, Data Mining and Knowledge Discovery, vol. 6, no. 4. Pp. 393-423, 2002. 21 International Journal of Software Engineering and Its Applications Applications Vol. 4, No. 2, April 2010 2010 Author Hyontai Sug received BS from Pusan National Unversity, Busan, Korea majoring Computer science & Statistics in 1983, MS from Hankuk University of Foreign Studies, Seoul, Korea majoring Computer science in 1986, and Ph.D. from University of Florida majoring Computer engineering in 1998. He was a researcher of Agency for Defense Development, Korea from 1986 to 1992, and a full-time lecturer of Pusan University of Foreign Studies, Busan, Korea from 1999 to 2001. Currently he is an associate professor of Dongseo University, Busan, Korea since 2001. His research interest includes data mining, knowledge engineering, and databases. 22