Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NESUG 17 Posters A SAS® Macro for Naïve Bayes Classification Vadim Pliner, Verizon Wireless, Orangeburg, NY ABSTRACT The supervised classification also known as pattern recognition, discrimination, or supervised learning consists of assigning new cases to one of a set of pre-defined classes given a sample of cases for which the true classes are known. The Naïve Bayes (NB) technique of supervised classification has become increasingly popular in the recent years. Despite its unrealistic assumption that features are independent of each other given class membership, the NB classifier is remarkably successful in practice even when the assumption of independence is violated. This paper presents a SAS® macro that does the NB classification and discusses several issues related to its implementation. INTRODUCTION Let X1,…, Xm denote our features (attributes), Y is the class number, and C is the number of classes. The problem consists of classifying the case (x1,…, xm) to the class c maximizing P(Y=c| X1=x1,…, Xm=xm) over c=1,…, C. Applying Bayes’ rule gives P(Y=c| X1=x1,…, Xm=xm) = P(X1=x1,…, Xm=xm | Y=c)P(Y=c) / P(X1=x1,…, Xm=xm). The denominator is invariant across classes, and therefore can be ignored. Under the NB’s assumption of conditional independence, P(X1=x1,…, Xm=xm | Y=c) is replaced by m ∏ P(Xi=xi | Y=c) and the NB classification reduces the original problem to that of finding i=1 the class m c* = argmax P(Y=c) ∏ P(Xi=xi | Y=c) c=1,…,C (1) i=1 Although the assumption that features are independent of each other given the class is often violated in the real world, the NB classifier is remarkably successful in practice [6,7,10,15] as well as in contests of predictive modeling programs. In the CoIL Challenge 2000 competition, the winning entry and the runner-up both used the NB approach [5], and in KDD-Cup 97, two of the top three contestants were also based on NB [14]. A number of studies were conducted to find out why and when NB works well [3,15,16]. IMPLEMENTATION OF NAÏVE-BAYES APPROACH When all features are discrete (categorical), estimating the probabilities in (1) can be done using frequency counts. In other words, P(Xi=xi | Y=c) is estimated as 1 NESUG 17 Posters #(Xi=xi & Y=c) / # (Y=c), where #() denotes the number of cases in the training data set satisfying the condition in parenthesis. Although the implementation of NB is straightforward, there are a few issues that should be addressed. 1. When continuous features are present, there is empirical and theoretical evidence that their discretization before applying NB is effective [4,8,18]. The alternative approach would be based on assumptions about the form of quantitative features’ probability distribution, which is usually unknown for real-world data. A number of publications [4,8,12,17] described and compared various methods of discretization of continuous features for NB classifiers and for other methods developed in the machine learning community. One of the simplest discretization methods, Equal Frequency Discretization (EFD), divides the sorted values of a continuous variable into k bins so that each bin contains approximately the same number of adjacent values (n/k), where k is a predefined parameter. Although it may be deemed simplistic, this method is often used and works surprisingly well for NB classifiers [8,17]. EFD can be easily implemented in SAS. For example, if we want to discretize two variables x and z from data set mydata and k=10 (in this case, the EFD method is usually referred to as “ten-bin”), this can be done as follows. proc rank data=mydata groups=10 out=newdata; var x z; ranks decile_x decile_z; run; decile_x and decile_z above are the names of the new categorical variables. 2. The case when a class and a feature value never occur together in the training set, i.e. #(Xi=xi & Y=c)=0 creates a problem, because assigning a probability of zero to one of the terms, P(Xi=xi | Y=c), causes the whole expression (1) to evaluate to zero for the class c and rule it out. The problem is especially severe when features have many values and the distribution is sparse. In this case a few classes can get a probability of zero. Several methods to handle this issue have been proposed. The zero probability can be replaced by a small constant, such as 0.5/n or P(Y=c)/n, where n is the number of observations in the training set [2,10]. Another approach is to apply the Laplace correction [1,11,13]. We used 0.5/n, which is, in our notation, equivalent to setting #(Xi=xi & Y=c) to 0.5. 3. Missing values. In some applications, values are missing not at random and can be meaningful. Therefore, in the macro below, missing values are treated as a separate category. If one does not want to treat missing values as a separate category, they should be handled prior to applying this macro with either a missing value imputation or excluding cases where they are present. THE %NB MACRO The text of the macro for Naïve-Bayes classification is shown below. It has five parameters: train - training data set (contains classified cases); score - data set containing cases to be classified; 2 NESUG 17 Posters nclass - # classes (C); target - name of the variable in the ‘train’ data set that has the class number (Y); ‘target’ is assumed to be a numeric variable with values 1,2,... for classes 1, 2, and so on; if it’s not, it has to be recoded before running the macro; inputs – the list of features’ names (X1,…, Xm). Lines 2-23 check whether all the macro parameters are specified. Lines 24-32 check if data sets ‘train’ and ‘score’ exist. Lines 33-36 compute the number of features specified in ‘inputs’ (m). Lines 37-46: calculating the prior probabilities and counts for all classes, i.e. P(Y=c) & #{Y=c} for all c=1,...,C. The results are stored in macro variables Prior1,..., PriorC and Count1,..., CountC, respectively Lines 47-71 compute the conditional probabilities in (1). Lines 72-86: the last data step does all multiplications in (1) to obtain the final NB classification. Actually, instead of multiplications it does summations of logarithms to calculate the logarithm of the product in (1) and to avoid a potential floating-point underflow resulting from multiplying lots of probabilities, which are between 0 and 1. A new variable _class_ (class number) is added to the data set &score. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 %macro NB(train=,score=,nclass=,target=,inputs=); %let error=0; %if %length(&train) = 0 %then %do; %put ERROR: Value for macro parameter TRAIN is missing ; %let error=1; %end; %if %length(&score) = 0 %then %do; %put ERROR: Value for macro parameter SCORE is missing ; %let error=1; %end; %if %length(&nclass) = 0 %then %do; %put ERROR: Value for macro parameter NCLASS is missing ; %let error=1; %end; %if %length(&target) = 0 %then %do; %put ERROR: Value for macro parameter TARGET is missing ; %let error=1; %end; %if %length(&inputs) = 0 %then %do; %put ERROR: Value for macro parameter INPUTS is missing ; %let error=1; %end; %if &error=1 %then %goto finish; %if %sysfunc(exist(&train)) = 0 %then %do; %put ERROR: data set &train does not exist ; %let error=1; %end; %if %sysfunc(exist(&score)) = 0 %then %do; %put ERROR: data set &score does not exist ; %let error=1; %end; %if &error=1 %then %goto finish; %LET nvar=0; %do %while (%length(%scan(&inputs,&nvar+1))>0); %LET nvar=%eval(&nvar+1); %end; proc freq data=&train noprint; tables &target / out=_priors_ ; run; %do k=1 %to &nclass; 3 NESUG 17 Posters 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 proc sql noprint; select percent, count into :Prior&k, :Count&k from _priors_ where &target=&k; quit; %end k; %do i=1 %to &nvar; %LET var=%scan(&inputs,&i); %do j=1 %to &nclass; proc freq data=&train noprint; tables &var / out=_&var.&j (drop=count) missing; where &target=&j; run; %end j; data _&var ; merge %do k=1 %to &nclass; _&var.&k (rename=(percent=percent&k)) %end; ; by &var; %do k=1 %to &nclass; if percent&k=. then percent&k=0; %end; run; proc sql; create table &score AS select a.* %do k=1 %to &nclass; , b.percent&k as percent&K._&var %end; from &score as a left join _&var as b on a.&var=b.&var; quit; %end i; data &score (drop=L product maxprob %do i=1 %to &nclass; percent&i._: %end;); set &score; maxprob=0; %do k=1 %to &nclass; array vars&k (&Nvar) %do i=1 %to &nvar; percent&K._%scan(&inputs,&i) %end; ; product=log(&&Prior&k); do L=1 to &nvar; if vars&k(L)>0 then product=product+log(vars&k(L)); else product=product+log(0.5)-log(&&count&k); end; if product>maxprob then do; maxprob=product; _class_=&k; end; %end k; run; %finish: ; %mend NB; MEASURING THE QUALITY OF CLASSIFICATION After running the %NB macro on a data set where the true classes are known (e.g., for the purposes of testing or validation), we may want to print out some measure(s) of concordance between the NB classification and the true classes. To get the matrix of misclassification plus some statistics we can simply use PROC FREQ as shown below. proc freq data=mydata; tables class*_class_ / chisq; run; 4 NESUG 17 Posters The values of class and _class_ above are the actual and NB classifications, respectively. The following macro can be used to obtain the misclassification rate. %macro misclass_rate(score=, target=); proc freq data=&score; tables &target*_class_ / out=_freq_out_ noprint; run; data _NULL_; file print; set _freq_out_ end=last; if &target^=_class_ then misclass+percent; if last then put "Misclassification rate is " misclass F5.2 "%"; run; %mend; The parameters ‘score’ and ‘target’ (variable containing the actual class numbers) above are the same as in the %NB macro. REFERENCES 1. B.Cestnik. Estimating probabilities: A crucial risk in machine learning, in L.C.Aiello (ed.). Proceedings of the 9th European Conference on Artificial Intelligence, 1990, pp.147-149. 2. P.Clark, T.Niblett. The CN2 induction algorithm, Machine Learning, 1989, 3(4), 261-283. 3. P.Domingos, M.Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130, 1997. 4. J.Dougherty, R.Kohavi, M.Sahami. Supervised and unsupervised discretization of continuous features. In A.Prieditis & S.Russell (eds). Machine Learning: Proceedings of the 12th International conference, 1995, pp.194-202. 5. C.Elkan. Magical thinking in data mining: lessons from CoIL Challenge 2000. Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (KDD’ 01). 6. J.H.Friedman. On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining and Knowledge Discovery, 1997, 1(1), 55-77. 7. D.J.Hand, K.Yu. Idiot’s Bayes – not so stupid after all? International Statistical Review, 2001, 69, 385-398. 8. C.-N.Hsu, H.-J. Huang, T.-T.Wong. Why discretization works for naïve Bayesian classifiers. 17th International Conference on Machine Learning (ICML-2000), 2000. 9. R.Kohavi, D.Sommerfield. Feature subset selection using the wrapper model: Overfitting and dynamic search space topology, in The First International Conference on Knowledge Discovery and Data mining, 1995, pp. 192-197. 10. R.Kohavi, D.Sommerfield, J.Dougherty. Data mining using MLC++: A machine learning library in C++, International Journal on Artificial Intelligence Tools, 1997, 6(4), 537-566. 11. R.Kohavi, B.Becker, D.Sommerfield. Improving simple Bayes, in The 9th European Conference on Machine Learning, Poster Papers, 1997. 5 NESUG 17 Posters 12. R.Kohavi, M.Sahami. Error-based and entropy-based discretization of continuous features, in Proceedings of the 2nd international conference on knowledge discovery and data mining, 1996, pp.114-119. 13. T.Niblett. Constructing decision trees in noisy domains. Proceedings of the 2nd European Working Session on Learning, Bled, Yugoslavia, 1987, pp.67-78. 14. I.Parsa. KDD-Cup 1997 presentation, http://www.kdnuggets.com/news/97/n25.html#item2. 15. I.Rish. An empirical study of the naïve Bayes classifier, Technical Report RC22230, IBM T.J. Watson Research Center, 2001. 16. I.Rish, J.Hellerstein, J.Thathachar. An analysis of data characteristics that affect naïve Bayes performance, Technical Report RC21993, IBM T.J. Watson Research Center, 2001. 17. Y.Yang, G.I.Webb. A comparative study of discretization methods for naïveBayes classifiers. In Proceedings of PKAW 2002, The 2002 Pacific Rim Knowledge Acquisition Workshop, Tokyo, Japan, pp.159-173. 18. Y.Yang, G.I.Webb. On Why Discretization Works for Naive-Bayes Classifiers. In Proceedings of the 16th Australian Conference on AI (AI 2003) Lecture Notes in Artificial Intelligence, v. 2903, pp 440-452. ACKNOWLEDGEMENTS SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. CONTACT INFORMATION Vadim Pliner Verizon Wireless 2000 Corporate Drive Orangeburg, NY 10962 Email: [email protected] 6