Download Supervised Discretization for Optimal Prediction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 30 (2014) 75 – 80
1st International Conference on Data Science, ICDS 2014
Supervised Discretization for optimal prediction
Wenxue Huanga , Yuanyi Panb,∗, Jianhong Wuc
a Departmeof
Mathematics, Guangzhou University ,Guangzhou, Guangdong 510006, China
Corp., 20 Queen Street West, Suite 316, Toronto, Ontario, Canada, M5H 3R3
c Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada, M3J 1P3
b InferSystems
Abstract
When data are high dimensional with a response variable categorical and explanatory variables mix-typed, a conveniently executable profile usually consists of categorical or categorized variables. This requires changing continuous variables to categorical
variables. A supervised discretization algorithm for optimal prediction (with the GK-lambda) is proposed. The comparison of this
algorithm with the supervised discretization for proportional prediction proposed in 1 is shown. Tests with some data sets from
Machine Learning Repository(UCI) are presented.
c 2014
2014 Published
The Authors.
Published
byOpen
Elsevier
B.V.
©
by Elsevier
B.V.
access
under CC BY-NC-ND license.
Selection
Selection and
and peer-review
peer-review under
under responsibility
responsibilityof
ofthe
theOrganizing
OrganizingCommittee
CommitteeofofICDS
ICDS2014.
2014.
Keywords: optimal prediction; proportional prediction; supervised discretization; the GK-lambda; the GK-tau
1. Introduction
In data mining and machine learning, a discretization means a categorization of a continuous variable into certain
levels. For example, one individual’s income can be leveled as low, medium or high; ages can be grouped by fiveyear steps. Assume that we work with a categorical response variable and explanatory variables among which some
or all are continuous variables. For the sake of easily executable profiling, or consistent with descriptive, analytical or averaging-effect oriented proportional prediction, an appropriate discretization is called for. Also, many data
techniques prefer categorical explanatory variables. The naive Bayes classifying model 2 , for instance, is applied in
many fields (when the explanatory variables are mutually independent) because of its simple thus quick estimation to
conditional probabilities. One of its basic assumptions is that the explanatory variables are all categorical. Another
example is the decision tree 3 . Each node in a decision tree is a condition that leads to the next node. The variables
involved in each node then have to either be categorical or described as a combination of intervals.
However, many real data sets from industrial applications contain continuous variables such as income, age, interest
rate, consumption amount, measure of risk, etc. One of practical solutions to this issue is to treat each distinct value
as a member belonging to an appropriate category.
∗ Yuanyi
Pan Tel.: +1-647-504-0617.
E-mail address: [email protected]
1877-0509 © 2014 Published by Elsevier B.V. Open access under CC BY-NC-ND license.
Selection and peer-review under responsibility of the Organizing Committee of ICDS 2014.
doi:10.1016/j.procs.2014.05.383
76
Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80
A natural way to group distinct values in a continuous variable is to find out the data-driven cutting points that
cut the whole range of data into intervals. There are two ways to identify the intervals: with or without a response
(or target) variable. Grouping continuous variable with a target (with a given criterion or objective function) is called
supervised discretization; while the other (with no link to the response variable) is called unsupervised discretization 4 .
Unsupervised discretizations are of interest to data projections with large number of response variables. There
are quite a few unsupervised discretazation algorithms. For macro social or economical data, or category product
consuming data, an unsupervised discretization algorithm can be based on normal distributions, due to the central
limit theorem. Other unsupervised discretization methods include equal interval width and equal frequency intervals 4 .
More sophisticated unsupervised methods requires certain quality measures to decide where to cut. One popular
measure is the information theoretical entropy-based 5,6 . The idea is to minimize the entropy in each interval by
adjusting the boundaries. Another big family of discretization methods is the clustering technologies 7 . Although
most of the methods in this family are applied to multi-dimensional cases, the simplest application of k-mean 8 can
also group a one-dimension continuous variable into k parts.
On the other hand, supervised discretization algorithms tune the boundaries by optimizing each interval’s coherence 9 associated with a target variable with an optimization criterion. An evaluation function is usually applied to
measure the discretization’s quality. Typical measures include Chi-square and conditional entropy. The Chi-square
based methods include ChiMerge 10 , Chi2 11 , Khiops 12 etc. The entropy based methods include the ones in 13 , 14 , 15 ,
etc. A simpler version of supervised method is Holte’s 1R algorithm 16 , which rules nothing but a minimum size of
each interval with a maximum number of the preferred class.
Which discretization method to be chosen depends on the project objective, the time computing complexity, the
greediness for the accuracy and which framework the method is applied to and the understandability 9 and executability
of the result. Nevertheless it is expected that unsupervised discretization methods are faster than supervised methods
but less accurate in predicting the target. An experimental evidence by Dougherty et al. 4 shows that entropy-based
discretization methods may perform quite well overall regarding the accuracy.
Rather than an entropy-based discretization method, a Gini concentration based Goodman-Kruskal tau (the GKtau, hereafter) associated discretization algorithm, was proposed in 1 to evaluate the cutting result. Most of times, the
entropy-based and the Gini-based are equivalent. The reason we prefer the Gini is not only because the Gini-based
GK-tau measure is more directly readable or interpretable than its entropy counterpart 17 .
The GK-tau 18 Section 9 is “a normalized conditional Gini concentration, measuring an averaging effect oriented
proportional global-to-global association” 19 . Thus it does not necessarily lead to the (maximum likelihood based)
optimal prediction. In other words, this approach focus more on matching the distribution of the predicted values with
that of the real ones.
The GK-tau may not be the best option when the point-hit accuracy is the primary concern. The authors believe
that Goodman-Kruskal λ (the GK-lambda hereafter) rather than GK-tau is a better choice under this concern.
It should be noted that although there are a lot of predicting procedures and their corresponding ways of measuring
the prediction accuracy, we focus only on two major methods. One is to predict by mode and another is to predict by
simulation. For an independent categorical value x with the conditional probabilities, predicting by mode is to predict
the target as the one with the maximal conditional probability while predicting by simulation is to randomly predict
the target by the conditional probabilities. The accuracy measure is the simple match rate although a distribution
distance measure like Kullback-Leibler distance 20 can also be used to evaluate the prediction.
This paper is organized as follows. Section 2 recalls the definition of GK-tau and GK-lambda and introduces
their implementations in a discretization framework. Section 3 compares these two measures by an experiment to
two data sets from Machine Learning Repository(UCI). We present some general discussions about discretization, its
application and future work in the last section.
2. Discretization with the GK-tau and the GK-lambda
Recall that for a categorical explanatory variable X with domain Dmn(X) = {1, 2, ..., nX } and a categorical target
variable Y with domain Dmn(Y) = {1, 2, ..., nY }, GK-tau, denoted by τ(Y|X) is given by
77
Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80
nY nX
τ(Y|X) =
i=1
j=1
nY nX
=
i=1
j=1
p(Y = i; X = j)2 /p(X = j) −
Y
1 − ni=1
p(Y = i)2
nY
i=1
p(Y = i)2
p(Y = i|X = j)p(Y = i; X = j) − E(p(Y))
1 − E(p(Y))
where p(·) is the probability of an event and p(Y = i|X = j) is the condition probability of Y = i given X = j.
Meanwhile, the GK-lambda, denoted by λ(Y|X) can be written as follows
n X
j=1 ρ jm − ρ·m
λ(Y|X) =
1 − ρ·m
where
ρ jm =
max {p(X = j; Y = i)}
i∈{1,2,...,nY }
and
ρ·m =
max {p(Y = i)}
i∈{1,2,...,nY }
Given that the theoretical simple match rate for predicting by mode without information of X is ρ·m , then λ(Y|X)
is the error reduction rate (or accuracy lift rate equivalent) with the information of X over the marginal information
of Y for predicting by mode. Thus it aims to maximize the prediction accuracy under prediction by mode. Similarly, E(p(Y)) is the theoretical prediction accuracy to predicting by simulation and τ(Y|X) is the corresponding error
reduction rate (or accuracy lift rate equivalent).
Please refer to 19,21 for more details about τ(Y|X). Further discussion regarding λ(Y|X) can be found in 18 .
For a given data set with continuous variable X and categorical nominal variable Y as defined above. Suppose
Ck = {c1 , ..., ck } is a set of distinct real numbers where c1 < c2 < ... < ck . Then Ck can be used to cut X into maximum
k + 1 intervals: (−∞, c1 ], (c1 , c2 ], ..., (ck , +∞). Then τ for a given cutting Ck can be defined as
nY k+1
τ(Y|X(Ck )) =
i=1
j=1
p(Y = i; ci−1 < X ≤ c j )2 /p(c j−1 < X ≤ c j ) −
Y
1 − ni=1
p(Y = i)2
nY
i=1
p(Y = i)2
where c0 = −∞ ad ck+1 = +∞.
The corresponding formula for λ is as follows.
k+1
λ(Y|X(Ck )) =
j=1
maxi∈{1,2,...,nY } {p(ci−1 < X ≤ c j ; Y = i)} − maxi∈{1,2,...,nY } {p(Y = i)}
1 − maxi∈{1,2,...,nY } {p(Y = i)}
A stepwise merging method similar to the greedy searching scheme suggested in 1 is described below to select
cutting points in X .
1. Create the initial cutting points C K to X by an unsupervised discretization method;
2. Select all the cutting points as the boundaries, noted as BK ;
3. Loop the following steps until the condition is met;
(a) Suppose Bm is the current set of boundaries;
(b) If m ≥ θb where θb is the predefined maximum number of intervals, stop the loop;
(c) Otherwise, generate Bm−1 by removing a boundary bk from Bm such that
bk = arg max τ(Y|(X(Bm \ b))
b∈Bm
78
Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80
if τ is the measure or
bk = arg max λ(Y|(X(Bm \ b))
b∈Bm
if λ is the measure.
Basically, this scheme checks all the available cutting points, finds out the one minus which the chosen cutting
points generate the biggest measure,τ or λ and stops only when the maximum number of intervals is reached. It is
apparently not a fancy approach but it has been widely used in various concretization algorithms according to 9 .
One technical issue regarding this scheme is how to select among multiple cutting points that have the same
measure. For the sake of convenience, we choose the one in the middle. For example, if 1,5, and 20 have the same
λ, 5 is then chosen as the cutting point for the next round. A consequence for this work-around is that the final
discretization may not have the maximum measure such that the prediction based on the discretization is not as good
as expected.
It is also a reason why we use merging rather than splitting scheme as proposed in 1 . Since λ looks at only one
probability for a given independent value while τ over-look all. It makes the λ-based search less sensitive to the
change of distribution than the τ-based search. Hence a stepwise λ-based search has better chance to misstep during
the process thus fails to find out the discretization with the maximum λ. By using merging scheme, this chance is
expected to be minimum.
3. Experiment
The major goal of this experiment is to show the difference betweens the λ-based discretization and the τ-based
discretization. As discussed above, the λ-based discretization is supposed to have better prediction accuracy than
the τ-based discretization when the prediction method is to predict by mode, while the latter one works better under
predicting by simulation. Usually this statement is supported by an experiment to some learning data sets and some
test data sets. The learning sets are used to generate discretization criteria, i.e., the boundaries for the predictors;
the same variables in the testing sets are discretized by these boundaries and the target variables are predicted. The
comparisons are illustrated by prediction accuracy lift. But a discretization to a single variable does not need such
complexity since the λ and τ themselves already tell the theoretical accuracy lift under different prediction methods
per aforementioned discussion. If λ after λ-based discretization is greater than λ after τ-based discretization, the
prediction accuracy lift under mode for the first one is expected to be better than the second one. Similarly, if τ after
τ-based discretization is greater than τ after λ-based discretization, the prediction accuracy lift under simulaation for
the first one is expected to be better than the second one.
The first data set is provided by J. A. Blackard et al. 22 from Machine Learning Repository(UCI). It has 581,012
rows, 10 numerical variables and the target variable has 7 classes. When the initial number of intervals for the nonsupervised discretization is set as 100, the non-supervised discretization method is based on equal frequencies, the
final number of intervals for the supervised discretization is 5, we have the following result. Table1
Table 1: Result for Covertype: ρ·m =0.4876,E(p(Y)) =0.3769
Numerical variables
Slope
Vertical Distance To Hydrology
Elevation
Hillshade 3pm
Hillshade Noon
Hillshade 9am
Horizontal Distance To Roadways
Aspect
Horizontal Distance To Hydrology
Horizontal Distance To Fire Points
λ after λ discret.
0.0005
0.0076
0.3601
0
0.0016
0.0006
0.0104
0.0074
0.0077
0.0163
λ after τ discret.
0
0
0.3510
0
0
0
0
0
0.0021
0.0085
τ after λ discret.
0.3811
0.3790
0.5231
0.3791
0.3800
0.3799
0.3836
0.3797
0.3794
0.3870
τ after τ discret.
0.3848
0.3803
0.5309
0.3805
0.3824
0.3826
0.3913
0.3821
0.3805
0.3919
79
Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80
All numerical variables but one in this example have higher λ after λ-based discretization; all numerical variables
have higher τ after τ-based discretiztion.
Other 7 data sets are chosen from UCI to further support the previous discussion. The discretization parameters
are the same as Covertype. The result is listed in Table 2. The calculation detail is available per request to the authors.
Please note that
•
•
•
•
Case 1:
Case 2:
Case 3:
Case 4:
λ is greater after λ-based discretization;
λ is smaller after λ-based discretization;
τ is greater after τ-based discretization;
τ is smaller after τ-based discretization;
The number of variables that both scheme share the same λ is the number of numerical variable minus the number of
case 1 and the number of case 2. Similarly goes τ.
Table 2: Result for other sets from UCI
Source
Glass 23
Image 24
Mfeat 25
Page-blocks 26
Thyroid 27
Waveform 28
weightlift 29
File
NA
NA
mfeat-fou
NA
thyroid0387
waveform-+noise
NA
Rows
214
2310
2000
5473
9172
5000
4024
Classes
7
7
10
5
5
3
5
ρ·m
0.3551
0.1429
0.1
0.8977
0.5989
0.3384
0.3405
E(p(Y))
0.2633
0.1429
0.1
0.8102
0.4385
0.3334
0.2866
Vars.
9
19
76
10
7
40
50
Case 1
5
12
70
4
3
39
43
Case 2
0
2
3
0
0
1
4
Case 3
5
14
75
9
7
39
45
Case 4
1
2
0
1
0
1
4
When the file information is NA, it means there is only one file from the corresponding source Only the numerical
variables are tested. The differences between an integer numerical variable and a continuous numerical variable is not
a concern in this experiment.
Table 2 suggests that statistically λ-based discretization is expected to have better accuracy lift by mode prediction
and τ-based discretization is expected to have better accuracy lift by simulating prediction, although there are still
cases of negative cases(2 and 4) in each example. The reason for these negative cases is the statistical uncertainty
from the stepwise discretization scheme. These result also support the previous statement that τ-based discretization
is more sensitive to the change of distribution since Case 3 is greater than case 1 on average.
4. Discussion and future work
We present a new supervised discretization scheme for the optimal prediction or with the GK-lambda. As the same
as a global-to-global association measure that we presented in 1 , the GK-tau, it is more interpretable than the entropy
measure. The difference between the GK-lambda and the GK-tau is that the GK-lambda looks at the prediction
accuracy lift and the GK-tau tries to match the target variable’s distribution. Our experiment shows that the GKlambda has better accuracy lift for predicting by mode and the GK-tau works better for predicting by simulation. The
experiment also shows that the GK-tau is more sensitive to the change of distribution thus has less negative evidences
of its strength.
A present work is to find a discretization that minimize the uncertainty from the stepwise discretization scheme.
Other work may relate to the performance of both measures in feature selection and how they work under other
prediction evaluation criteria, e.g., lift curve based measure, the distribution of the predicted target etc.
References
1. W. Huang, Y. Pan, J. Wu, Supervised discretization with GK – τ, Procedia Computer Science 17 (2013) 114 – 120.
2. I. Rish, An empirical study of the naive bayes classifier, in: IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3,
2001, pp. 41–46.
80
Wenxue Huang et al / Procedia Computer Science 30 (2014) 75 – 80
3. S. Safavian, D. Landgrebe, A survey of decision tree classifier methodology, Systems, Man and Cybernetics, IEEE Transactions on 21 (3)
(1991) 660–674.
4. J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsupervised discretization of continuous features, in: Machine Learning: Proceedings
of the Twelfth International Conference, Morgan Kaufmann Publishers, Inc., 1995, pp. 194–202.
5. M. R. Chmielewski, J. W. Grzymala-Busse, Global discretization of continuous attributes as preprocessing for machine learning, International
journal of approximate reasoning 15 (4) (1996) 319–331.
6. U. Fayyad, K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, Proceedings of the International
Joint Conference on Uncertainty in AI.
7. G. Gan, C. Ma, J. Wu, Data clustering: Theory, algorithms, and applications (asa-siam series on statistics and applied probability). 2007,
Society for Industrial & Applied Mathematics, USA.
8. J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on
mathematical statistics and probability, California, USA, 1967, pp. 281–297.
9. S. Kotsiantis, D. Kanellopoulos, Discretization techniques: A recent survey, GESTS International Transactions on Computer Science and
Engineering 32 (1) (2006) 47–58.
10. R. Kerber, Chimerge: Discretization of numeric attributes, in: Proceedings of the tenth national conference on Artificial intelligence, AAAI
Press, 1992, pp. 123–128.
11. H. Liu, R. Setiono, Chi2: Feature selection and discretization of numeric attributes, in: In Proceedings of the Seventh International Conference
on Tools with Artificial Intelligence, IEEE, 1995, pp. 388–391.
12. M. Boulle, Khiops: A statistical discretization method of continuous attributes, Machine Learning 55 (1) (2004) 53–69.
13. D. Chiu, B. Cheung, A. Wong, Information synthesis based on hierarchical maximum entropy discretization, Journal of Experimental &
Theoretical Artificial Intelligence 2 (2) (1990) 117–129.
14. J. Catlett, On changing continuous attributes into ordered discrete attributes, in: Machine Learning – WSL-91, Springer, 1991, pp. 164–178.
15. K. Ting, Discretization of continuous-valued attributes and instance-based learning, Basser Department of Computer Science, University of
Sydney, 1994.
16. R. Holte, Very simple classification rules perform well on most commonly used datasets, Machine learning 11 (1) (1993) 63–90.
17. W. Huang, M. Vainder, Dependence degree and feature selection for categorical data, Workshop on Data Mining Methodology and Applications at The Fields Institute.
18. L. Goodman, W. Kruskal, Measures of association for cross classifications*, journal of the American Statistical Association 49 (268) (1954)
732–764.
19. W. Huang, Y. Shi, X. Wang, Nominal association vector and matrix, arXiv preprint arXiv:1109.2553.
20. S. Kullback, R. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1951) 79–86.
21. C. J. Lloyd, Statistical analysis of categorical data, A Wiley-Interscience publication, Wiley, New York, NY, USA, 1999.
22. J. A. Blackard, D. J. Dean, C. W. Anderson, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/
Covertype (1998).
23. B. German, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Glass+Identification (1987).
24. C. Brodley, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Image+Segmentation (1990).
25. R. P. Duin, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Multiple+Features (1999).
26. D. Malerba, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification (1995).
27. R. Quinlan, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease (1987).
28. L. Breiman, J. Friedman, R. Olshen, C. Stone, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/
Waveform+Database+Generator+\%28Version+1\%29 (1988).
29. W. Ugulino, E. Velloso, UCI machine learning repository, http://archive.ics.uci.edu/ml/datasets/Weight+Lifting+
Exercises+monitored+with+Inertial+Measurement+Units (2013).