Download Supervised Discretization for Optimal Prediction

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 30 (2014) 75 – 80 1st International Conference on Data Science, ICDS 2014 Supervised Discretization for optimal prediction Wenxue Huanga , Yuanyi Panb,∗, Jianhong Wuc a Departmeof Mathematics, Guangzhou University ,Guangzhou, Guangdong 510006, China Corp., 20 Queen Street West, Suite 316, Toronto, Ontario, Canada, M5H 3R3 c Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada, M3J 1P3 b InferSystems Abstract When data are high dimensional with a response variable categorical and explanatory variables mix-typed, a conveniently executable profile usually consists of categorical or categorized variables. This requires changing continuous variables to categorical variables. A supervised discretization algorithm for optimal prediction (with the GK-lambda) is proposed. The comparison of this algorithm with the supervised discretization for proportional prediction proposed in 1 is shown. Tests with some data sets from Machine Learning Repository(UCI) are presented. c 2014 2014 Published The Authors. Published byOpen Elsevier B.V. © by Elsevier B.V. access under CC BY-NC-ND license. Selection Selection and and peer-review peer-review under under responsibility responsibilityof ofthe theOrganizing OrganizingCommittee CommitteeofofICDS ICDS2014. 2014. Keywords: optimal prediction; proportional prediction; supervised discretization; the GK-lambda; the GK-tau 1. Introduction In data mining and machine learning, a discretization means a categorization of a continuous variable into certain levels. For example, one individual’s income can be leveled as low, medium or high; ages can be grouped by fiveyear steps. Assume that we work with a categorical response variable and explanatory variables among which some or all are continuous variables. For the sake of easily executable profiling, or consistent with descriptive, analytical or averaging-effect oriented proportional prediction, an appropriate discretization is called for. Also, many data techniques prefer categorical explanatory variables. The naive Bayes classifying model 2 , for instance, is applied in many fields (when the explanatory variables are mutually independent) because of its simple thus quick estimation to conditional probabilities. One of its basic assumptions is that the explanatory variables are all categorical. Another example is the decision tree 3 . Each node in a decision tree is a condition that leads to the next node. The variables involved in each node then have to either be categorical or described as a combination of intervals. However, many real data sets from industrial applications contain continuous variables such as income, age, interest rate, consumption amount, measure of risk, etc. One of practical solutions to this issue is to treat each distinct value as a member belonging to an appropriate category. ∗ Yuanyi Pan Tel.: +1-647-504-0617. E-mail address: [email protected] 1877-0509 © 2014 Published by Elsevier B.V. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of the Organizing Committee of ICDS 2014. doi:10.1016/j.procs.2014.05.383 76 Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80 A natural way to group distinct values in a continuous variable is to find out the data-driven cutting points that cut the whole range of data into intervals. There are two ways to identify the intervals: with or without a response (or target) variable. Grouping continuous variable with a target (with a given criterion or objective function) is called supervised discretization; while the other (with no link to the response variable) is called unsupervised discretization 4 . Unsupervised discretizations are of interest to data projections with large number of response variables. There are quite a few unsupervised discretazation algorithms. For macro social or economical data, or category product consuming data, an unsupervised discretization algorithm can be based on normal distributions, due to the central limit theorem. Other unsupervised discretization methods include equal interval width and equal frequency intervals 4 . More sophisticated unsupervised methods requires certain quality measures to decide where to cut. One popular measure is the information theoretical entropy-based 5,6 . The idea is to minimize the entropy in each interval by adjusting the boundaries. Another big family of discretization methods is the clustering technologies 7 . Although most of the methods in this family are applied to multi-dimensional cases, the simplest application of k-mean 8 can also group a one-dimension continuous variable into k parts. On the other hand, supervised discretization algorithms tune the boundaries by optimizing each interval’s coherence 9 associated with a target variable with an optimization criterion. An evaluation function is usually applied to measure the discretization’s quality. Typical measures include Chi-square and conditional entropy. The Chi-square based methods include ChiMerge 10 , Chi2 11 , Khiops 12 etc. The entropy based methods include the ones in 13 , 14 , 15 , etc. A simpler version of supervised method is Holte’s 1R algorithm 16 , which rules nothing but a minimum size of each interval with a maximum number of the preferred class. Which discretization method to be chosen depends on the project objective, the time computing complexity, the greediness for the accuracy and which framework the method is applied to and the understandability 9 and executability of the result. Nevertheless it is expected that unsupervised discretization methods are faster than supervised methods but less accurate in predicting the target. An experimental evidence by Dougherty et al. 4 shows that entropy-based discretization methods may perform quite well overall regarding the accuracy. Rather than an entropy-based discretization method, a Gini concentration based Goodman-Kruskal tau (the GKtau, hereafter) associated discretization algorithm, was proposed in 1 to evaluate the cutting result. Most of times, the entropy-based and the Gini-based are equivalent. The reason we prefer the Gini is not only because the Gini-based GK-tau measure is more directly readable or interpretable than its entropy counterpart 17 . The GK-tau 18 Section 9 is “a normalized conditional Gini concentration, measuring an averaging effect oriented proportional global-to-global association” 19 . Thus it does not necessarily lead to the (maximum likelihood based) optimal prediction. In other words, this approach focus more on matching the distribution of the predicted values with that of the real ones. The GK-tau may not be the best option when the point-hit accuracy is the primary concern. The authors believe that Goodman-Kruskal λ (the GK-lambda hereafter) rather than GK-tau is a better choice under this concern. It should be noted that although there are a lot of predicting procedures and their corresponding ways of measuring the prediction accuracy, we focus only on two major methods. One is to predict by mode and another is to predict by simulation. For an independent categorical value x with the conditional probabilities, predicting by mode is to predict the target as the one with the maximal conditional probability while predicting by simulation is to randomly predict the target by the conditional probabilities. The accuracy measure is the simple match rate although a distribution distance measure like Kullback-Leibler distance 20 can also be used to evaluate the prediction. This paper is organized as follows. Section 2 recalls the definition of GK-tau and GK-lambda and introduces their implementations in a discretization framework. Section 3 compares these two measures by an experiment to two data sets from Machine Learning Repository(UCI). We present some general discussions about discretization, its application and future work in the last section. 2. Discretization with the GK-tau and the GK-lambda Recall that for a categorical explanatory variable X with domain Dmn(X) = {1, 2, ..., nX } and a categorical target variable Y with domain Dmn(Y) = {1, 2, ..., nY }, GK-tau, denoted by τ(Y|X) is given by 77 Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80 nY nX τ(Y|X) = i=1 j=1 nY nX = i=1 j=1 p(Y = i; X = j)2 /p(X = j) − Y 1 − ni=1 p(Y = i)2 nY i=1 p(Y = i)2 p(Y = i|X = j)p(Y = i; X = j) − E(p(Y)) 1 − E(p(Y)) where p(·) is the probability of an event and p(Y = i|X = j) is the condition probability of Y = i given X = j. Meanwhile, the GK-lambda, denoted by λ(Y|X) can be written as follows n X j=1 ρ jm − ρ·m λ(Y|X) = 1 − ρ·m where ρ jm = max {p(X = j; Y = i)} i∈{1,2,...,nY } and ρ·m = max {p(Y = i)} i∈{1,2,...,nY } Given that the theoretical simple match rate for predicting by mode without information of X is ρ·m , then λ(Y|X) is the error reduction rate (or accuracy lift rate equivalent) with the information of X over the marginal information of Y for predicting by mode. Thus it aims to maximize the prediction accuracy under prediction by mode. Similarly, E(p(Y)) is the theoretical prediction accuracy to predicting by simulation and τ(Y|X) is the corresponding error reduction rate (or accuracy lift rate equivalent). Please refer to 19,21 for more details about τ(Y|X). Further discussion regarding λ(Y|X) can be found in 18 . For a given data set with continuous variable X and categorical nominal variable Y as defined above. Suppose Ck = {c1 , ..., ck } is a set of distinct real numbers where c1 < c2 < ... < ck . Then Ck can be used to cut X into maximum k + 1 intervals: (−∞, c1 ], (c1 , c2 ], ..., (ck , +∞). Then τ for a given cutting Ck can be defined as nY k+1 τ(Y|X(Ck )) = i=1 j=1 p(Y = i; ci−1 < X ≤ c j )2 /p(c j−1 < X ≤ c j ) − Y 1 − ni=1 p(Y = i)2 nY i=1 p(Y = i)2 where c0 = −∞ ad ck+1 = +∞. The corresponding formula for λ is as follows. k+1 λ(Y|X(Ck )) = j=1 maxi∈{1,2,...,nY } {p(ci−1 < X ≤ c j ; Y = i)} − maxi∈{1,2,...,nY } {p(Y = i)} 1 − maxi∈{1,2,...,nY } {p(Y = i)} A stepwise merging method similar to the greedy searching scheme suggested in 1 is described below to select cutting points in X . 1. Create the initial cutting points C K to X by an unsupervised discretization method; 2. Select all the cutting points as the boundaries, noted as BK ; 3. Loop the following steps until the condition is met; (a) Suppose Bm is the current set of boundaries; (b) If m ≥ θb where θb is the predefined maximum number of intervals, stop the loop; (c) Otherwise, generate Bm−1 by removing a boundary bk from Bm such that bk = arg max τ(Y|(X(Bm \ b)) b∈Bm 78 Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80 if τ is the measure or bk = arg max λ(Y|(X(Bm \ b)) b∈Bm if λ is the measure. Basically, this scheme checks all the available cutting points, finds out the one minus which the chosen cutting points generate the biggest measure,τ or λ and stops only when the maximum number of intervals is reached. It is apparently not a fancy approach but it has been widely used in various concretization algorithms according to 9 . One technical issue regarding this scheme is how to select among multiple cutting points that have the same measure. For the sake of convenience, we choose the one in the middle. For example, if 1,5, and 20 have the same λ, 5 is then chosen as the cutting point for the next round. A consequence for this work-around is that the final discretization may not have the maximum measure such that the prediction based on the discretization is not as good as expected. It is also a reason why we use merging rather than splitting scheme as proposed in 1 . Since λ looks at only one probability for a given independent value while τ over-look all. It makes the λ-based search less sensitive to the change of distribution than the τ-based search. Hence a stepwise λ-based search has better chance to misstep during the process thus fails to find out the discretization with the maximum λ. By using merging scheme, this chance is expected to be minimum. 3. Experiment The major goal of this experiment is to show the difference betweens the λ-based discretization and the τ-based discretization. As discussed above, the λ-based discretization is supposed to have better prediction accuracy than the τ-based discretization when the prediction method is to predict by mode, while the latter one works better under predicting by simulation. Usually this statement is supported by an experiment to some learning data sets and some test data sets. The learning sets are used to generate discretization criteria, i.e., the boundaries for the predictors; the same variables in the testing sets are discretized by these boundaries and the target variables are predicted. The comparisons are illustrated by prediction accuracy lift. But a discretization to a single variable does not need such complexity since the λ and τ themselves already tell the theoretical accuracy lift under different prediction methods per aforementioned discussion. If λ after λ-based discretization is greater than λ after τ-based discretization, the prediction accuracy lift under mode for the first one is expected to be better than the second one. Similarly, if τ after τ-based discretization is greater than τ after λ-based discretization, the prediction accuracy lift under simulaation for the first one is expected to be better than the second one. The first data set is provided by J. A. Blackard et al. 22 from Machine Learning Repository(UCI). It has 581,012 rows, 10 numerical variables and the target variable has 7 classes. When the initial number of intervals for the nonsupervised discretization is set as 100, the non-supervised discretization method is based on equal frequencies, the final number of intervals for the supervised discretization is 5, we have the following result. Table1 Table 1: Result for Covertype: ρ·m =0.4876,E(p(Y)) =0.3769 Numerical variables Slope Vertical Distance To Hydrology Elevation Hillshade 3pm Hillshade Noon Hillshade 9am Horizontal Distance To Roadways Aspect Horizontal Distance To Hydrology Horizontal Distance To Fire Points λ after λ discret. 0.0005 0.0076 0.3601 0 0.0016 0.0006 0.0104 0.0074 0.0077 0.0163 λ after τ discret. 0 0 0.3510 0 0 0 0 0 0.0021 0.0085 τ after λ discret. 0.3811 0.3790 0.5231 0.3791 0.3800 0.3799 0.3836 0.3797 0.3794 0.3870 τ after τ discret. 0.3848 0.3803 0.5309 0.3805 0.3824 0.3826 0.3913 0.3821 0.3805 0.3919 79 Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80 All numerical variables but one in this example have higher λ after λ-based discretization; all numerical variables have higher τ after τ-based discretiztion. Other 7 data sets are chosen from UCI to further support the previous discussion. The discretization parameters are the same as Covertype. The result is listed in Table 2. The calculation detail is available per request to the authors. Please note that • • • • Case 1: Case 2: Case 3: Case 4: λ is greater after λ-based discretization; λ is smaller after λ-based discretization; τ is greater after τ-based discretization; τ is smaller after τ-based discretization; The number of variables that both scheme share the same λ is the number of numerical variable minus the number of case 1 and the number of case 2. Similarly goes τ. Table 2: Result for other sets from UCI Source Glass 23 Image 24 Mfeat 25 Page-blocks 26 Thyroid 27 Waveform 28 weightlift 29 File NA NA mfeat-fou NA thyroid0387 waveform-+noise NA Rows 214 2310 2000 5473 9172 5000 4024 Classes 7 7 10 5 5 3 5 ρ·m 0.3551 0.1429 0.1 0.8977 0.5989 0.3384 0.3405 E(p(Y)) 0.2633 0.1429 0.1 0.8102 0.4385 0.3334 0.2866 Vars. 9 19 76 10 7 40 50 Case 1 5 12 70 4 3 39 43 Case 2 0 2 3 0 0 1 4 Case 3 5 14 75 9 7 39 45 Case 4 1 2 0 1 0 1 4 When the file information is NA, it means there is only one file from the corresponding source Only the numerical variables are tested. The differences between an integer numerical variable and a continuous numerical variable is not a concern in this experiment. Table 2 suggests that statistically λ-based discretization is expected to have better accuracy lift by mode prediction and τ-based discretization is expected to have better accuracy lift by simulating prediction, although there are still cases of negative cases(2 and 4) in each example. The reason for these negative cases is the statistical uncertainty from the stepwise discretization scheme. These result also support the previous statement that τ-based discretization is more sensitive to the change of distribution since Case 3 is greater than case 1 on average. 4. Discussion and future work We present a new supervised discretization scheme for the optimal prediction or with the GK-lambda. As the same as a global-to-global association measure that we presented in 1 , the GK-tau, it is more interpretable than the entropy measure. The difference between the GK-lambda and the GK-tau is that the GK-lambda looks at the prediction accuracy lift and the GK-tau tries to match the target variable’s distribution. Our experiment shows that the GKlambda has better accuracy lift for predicting by mode and the GK-tau works better for predicting by simulation. The experiment also shows that the GK-tau is more sensitive to the change of distribution thus has less negative evidences of its strength. A present work is to find a discretization that minimize the uncertainty from the stepwise discretization scheme. Other work may relate to the performance of both measures in feature selection and how they work under other prediction evaluation criteria, e.g., lift curve based measure, the distribution of the predicted target etc. References 1. W. Huang, Y. Pan, J. Wu, Supervised discretization with GK – τ, Procedia Computer Science 17 (2013) 114 – 120. 2. I. Rish, An empirical study of the naive bayes classifier, in: IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3, 2001, pp. 41–46. 80 Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80 3. S. Safavian, D. Landgrebe, A survey of decision tree classifier methodology, Systems, Man and Cybernetics, IEEE Transactions on 21 (3) (1991) 660–674. 4. J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsupervised discretization of continuous features, in: Machine Learning: Proceedings of the Twelfth International Conference, Morgan Kaufmann Publishers, Inc., 1995, pp. 194–202. 5. M. R. Chmielewski, J. W. Grzymala-Busse, Global discretization of continuous attributes as preprocessing for machine learning, International journal of approximate reasoning 15 (4) (1996) 319–331. 6. U. Fayyad, K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, Proceedings of the International Joint Conference on Uncertainty in AI. 7. G. Gan, C. Ma, J. Wu, Data clustering: Theory, algorithms, and applications (asa-siam series on statistics and applied probability). 2007, Society for Industrial & Applied Mathematics, USA. 8. J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, California, USA, 1967, pp. 281–297. 9. S. Kotsiantis, D. Kanellopoulos, Discretization techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering 32 (1) (2006) 47–58. 10. R. Kerber, Chimerge: Discretization of numeric attributes, in: Proceedings of the tenth national conference on Artificial intelligence, AAAI Press, 1992, pp. 123–128. 11. H. Liu, R. Setiono, Chi2: Feature selection and discretization of numeric attributes, in: In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, IEEE, 1995, pp. 388–391. 12. M. Boulle, Khiops: A statistical discretization method of continuous attributes, Machine Learning 55 (1) (2004) 53–69. 13. D. Chiu, B. Cheung, A. Wong, Information synthesis based on hierarchical maximum entropy discretization, Journal of Experimental & Theoretical Artificial Intelligence 2 (2) (1990) 117–129. 14. J. Catlett, On changing continuous attributes into ordered discrete attributes, in: Machine Learning – WSL-91, Springer, 1991, pp. 164–178. 15. K. Ting, Discretization of continuous-valued attributes and instance-based learning, Basser Department of Computer Science, University of Sydney, 1994. 16. R. Holte, Very simple classification rules perform well on most commonly used datasets, Machine learning 11 (1) (1993) 63–90. 17. W. Huang, M. Vainder, Dependence degree and feature selection for categorical data, Workshop on Data Mining Methodology and Applications at The Fields Institute. 18. L. Goodman, W. Kruskal, Measures of association for cross classifications*, journal of the American Statistical Association 49 (268) (1954) 732–764. 19. W. Huang, Y. Shi, X. Wang, Nominal association vector and matrix, arXiv preprint arXiv:1109.2553. 20. S. Kullback, R. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1951) 79–86. 21. C. J. Lloyd, Statistical analysis of categorical data, A Wiley-Interscience publication, Wiley, New York, NY, USA, 1999. 22. J. A. Blackard, D. J. Dean, C. W. Anderson, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/ Covertype (1998). 23. B. German, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Glass+Identification (1987). 24. C. Brodley, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Image+Segmentation (1990). 25. R. P. Duin, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Multiple+Features (1999). 26. D. Malerba, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification (1995). 27. R. Quinlan, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease (1987). 28. L. Breiman, J. Friedman, R. Olshen, C. Stone, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/ Waveform+Database+Generator+\%28Version+1\%29 (1988). 29. W. Ugulino, E. Velloso, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Weight+Lifting+ Exercises+monitored+with+Inertial+Measurement+Units (2013).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supervised Discretization for Optimal Prediction