Download A SAS Macro for Naive Bayes Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
NESUG 17
Posters
A SAS® Macro for Naïve Bayes Classification
Vadim Pliner, Verizon Wireless, Orangeburg, NY
ABSTRACT
The supervised classification also known as pattern recognition, discrimination, or
supervised learning consists of assigning new cases to one of a set of pre-defined classes
given a sample of cases for which the true classes are known. The Naïve Bayes (NB)
technique of supervised classification has become increasingly popular in the recent
years. Despite its unrealistic assumption that features are independent of each other given
class membership, the NB classifier is remarkably successful in practice even when the
assumption of independence is violated. This paper presents a SAS® macro that does the
NB classification and discusses several issues related to its implementation.
INTRODUCTION
Let X1,…, Xm denote our features (attributes), Y is the class number, and C is the number
of classes. The problem consists of classifying the case (x1,…, xm) to the class c
maximizing P(Y=c| X1=x1,…, Xm=xm) over c=1,…, C. Applying Bayes’ rule gives
P(Y=c| X1=x1,…, Xm=xm) = P(X1=x1,…, Xm=xm | Y=c)P(Y=c) / P(X1=x1,…, Xm=xm).
The denominator is invariant across classes, and therefore can be ignored. Under the
NB’s assumption of conditional independence, P(X1=x1,…, Xm=xm | Y=c) is replaced by
m
∏ P(Xi=xi | Y=c) and the NB classification reduces the original problem to that of finding
i=1
the class
m
c* = argmax P(Y=c) ∏ P(Xi=xi | Y=c)
c=1,…,C
(1)
i=1
Although the assumption that features are independent of each other given the class is
often violated in the real world, the NB classifier is remarkably successful in practice
[6,7,10,15] as well as in contests of predictive modeling programs. In the CoIL Challenge
2000 competition, the winning entry and the runner-up both used the NB approach [5],
and in KDD-Cup 97, two of the top three contestants were also based on NB [14]. A
number of studies were conducted to find out why and when NB works well [3,15,16].
IMPLEMENTATION OF NAÏVE-BAYES APPROACH
When all features are discrete (categorical), estimating the probabilities in (1) can be
done using frequency counts. In other words, P(Xi=xi | Y=c) is estimated as
1
NESUG 17
Posters
#(Xi=xi & Y=c) / # (Y=c), where #() denotes the number of cases in the training data set
satisfying the condition in parenthesis. Although the implementation of NB is
straightforward, there are a few issues that should be addressed.
1. When continuous features are present, there is empirical and theoretical evidence that
their discretization before applying NB is effective [4,8,18]. The alternative approach
would be based on assumptions about the form of quantitative features’ probability
distribution, which is usually unknown for real-world data. A number of publications
[4,8,12,17] described and compared various methods of discretization of continuous
features for NB classifiers and for other methods developed in the machine learning
community. One of the simplest discretization methods, Equal Frequency Discretization
(EFD), divides the sorted values of a continuous variable into k bins so that each bin
contains approximately the same number of adjacent values (n/k), where k is a predefined
parameter. Although it may be deemed simplistic, this method is often used and works
surprisingly well for NB classifiers [8,17]. EFD can be easily implemented in SAS. For
example, if we want to discretize two variables x and z from data set mydata and k=10 (in
this case, the EFD method is usually referred to as “ten-bin”), this can be done as follows.
proc rank data=mydata groups=10 out=newdata;
var x z;
ranks decile_x decile_z;
run;
decile_x and decile_z above are the names of the new categorical variables.
2. The case when a class and a feature value never occur together in the training set, i.e.
#(Xi=xi & Y=c)=0 creates a problem, because assigning a probability of zero to one of
the terms, P(Xi=xi | Y=c), causes the whole expression (1) to evaluate to zero for the class
c and rule it out. The problem is especially severe when features have many values and
the distribution is sparse. In this case a few classes can get a probability of zero. Several
methods to handle this issue have been proposed. The zero probability can be replaced by
a small constant, such as 0.5/n or P(Y=c)/n, where n is the number of observations in the
training set [2,10]. Another approach is to apply the Laplace correction [1,11,13]. We
used 0.5/n, which is, in our notation, equivalent to setting #(Xi=xi & Y=c) to 0.5.
3. Missing values. In some applications, values are missing not at random and can be
meaningful. Therefore, in the macro below, missing values are treated as a separate
category. If one does not want to treat missing values as a separate category, they should
be handled prior to applying this macro with either a missing value imputation or
excluding cases where they are present.
THE %NB MACRO
The text of the macro for Naïve-Bayes classification is shown below. It has five
parameters:
train - training data set (contains classified cases);
score - data set containing cases to be classified;
2
NESUG 17
Posters
nclass - # classes (C);
target - name of the variable in the ‘train’ data set that has the class number (Y); ‘target’
is assumed to be a numeric variable with values 1,2,... for classes 1, 2, and so on; if it’s
not, it has to be recoded before running the macro;
inputs – the list of features’ names (X1,…, Xm).
Lines 2-23 check whether all the macro parameters are specified.
Lines 24-32 check if data sets ‘train’ and ‘score’ exist.
Lines 33-36 compute the number of features specified in ‘inputs’ (m).
Lines 37-46: calculating the prior probabilities and counts for all classes, i.e. P(Y=c) &
#{Y=c} for all c=1,...,C. The results are stored in macro variables Prior1,..., PriorC and
Count1,..., CountC, respectively
Lines 47-71 compute the conditional probabilities in (1).
Lines 72-86: the last data step does all multiplications in (1) to obtain the final NB
classification. Actually, instead of multiplications it does summations of logarithms to
calculate the logarithm of the product in (1) and to avoid a potential floating-point
underflow resulting from multiplying lots of probabilities, which are between 0 and 1. A
new variable _class_ (class number) is added to the data set &score.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
%macro NB(train=,score=,nclass=,target=,inputs=);
%let error=0;
%if %length(&train) = 0 %then %do;
%put ERROR: Value for macro parameter TRAIN is missing ;
%let error=1;
%end;
%if %length(&score) = 0 %then %do;
%put ERROR: Value for macro parameter SCORE is missing ;
%let error=1;
%end;
%if %length(&nclass) = 0 %then %do;
%put ERROR: Value for macro parameter NCLASS is missing ;
%let error=1;
%end;
%if %length(&target) = 0 %then %do;
%put ERROR: Value for macro parameter TARGET is missing ;
%let error=1;
%end;
%if %length(&inputs) = 0 %then %do;
%put ERROR: Value for macro parameter INPUTS is missing ;
%let error=1;
%end;
%if &error=1 %then %goto finish;
%if %sysfunc(exist(&train)) = 0 %then %do;
%put ERROR: data set &train does not exist ;
%let error=1;
%end;
%if %sysfunc(exist(&score)) = 0 %then %do;
%put ERROR: data set &score does not exist ;
%let error=1;
%end;
%if &error=1 %then %goto finish;
%LET nvar=0;
%do %while (%length(%scan(&inputs,&nvar+1))>0);
%LET nvar=%eval(&nvar+1);
%end;
proc freq data=&train noprint;
tables &target / out=_priors_ ;
run;
%do k=1 %to &nclass;
3
NESUG 17
Posters
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
proc sql noprint;
select percent, count into :Prior&k, :Count&k
from _priors_
where &target=&k;
quit;
%end k;
%do i=1 %to &nvar;
%LET var=%scan(&inputs,&i);
%do j=1 %to &nclass;
proc freq data=&train noprint;
tables &var / out=_&var.&j (drop=count) missing;
where &target=&j;
run;
%end j;
data _&var ;
merge %do k=1 %to &nclass;
_&var.&k (rename=(percent=percent&k))
%end; ;
by &var;
%do k=1 %to &nclass; if percent&k=. then percent&k=0; %end;
run;
proc sql;
create table &score AS
select a.*
%do k=1 %to &nclass;
, b.percent&k as percent&K._&var
%end;
from &score as a left join _&var as b
on a.&var=b.&var;
quit;
%end i;
data &score (drop=L product maxprob
%do i=1 %to &nclass; percent&i._: %end;);
set &score;
maxprob=0;
%do k=1 %to &nclass;
array vars&k (&Nvar)
%do i=1 %to &nvar; percent&K._%scan(&inputs,&i) %end; ;
product=log(&&Prior&k);
do L=1 to &nvar;
if vars&k(L)>0 then product=product+log(vars&k(L)); else
product=product+log(0.5)-log(&&count&k);
end;
if product>maxprob then do; maxprob=product; _class_=&k; end;
%end k;
run;
%finish: ;
%mend NB;
MEASURING THE QUALITY OF CLASSIFICATION
After running the %NB macro on a data set where the true classes are known (e.g., for the
purposes of testing or validation), we may want to print out some measure(s) of
concordance between the NB classification and the true classes. To get the matrix of
misclassification plus some statistics we can simply use PROC FREQ as shown below.
proc freq data=mydata;
tables class*_class_ / chisq;
run;
4
NESUG 17
Posters
The values of class and _class_ above are the actual and NB classifications, respectively.
The following macro can be used to obtain the misclassification rate.
%macro misclass_rate(score=, target=);
proc freq data=&score;
tables &target*_class_ / out=_freq_out_ noprint;
run;
data _NULL_;
file print;
set _freq_out_ end=last;
if &target^=_class_ then misclass+percent;
if last then put "Misclassification rate is " misclass F5.2 "%";
run;
%mend;
The parameters ‘score’ and ‘target’ (variable containing the actual class numbers) above
are the same as in the %NB macro.
REFERENCES
1. B.Cestnik. Estimating probabilities: A crucial risk in machine learning, in
L.C.Aiello (ed.). Proceedings of the 9th European Conference on Artificial
Intelligence, 1990, pp.147-149.
2. P.Clark, T.Niblett. The CN2 induction algorithm, Machine Learning, 1989, 3(4),
261-283.
3. P.Domingos, M.Pazzani. On the optimality of the simple Bayesian classifier
under zero-one loss. Machine Learning, 29, 103-130, 1997.
4. J.Dougherty, R.Kohavi, M.Sahami. Supervised and unsupervised discretization of
continuous features. In A.Prieditis & S.Russell (eds). Machine Learning:
Proceedings of the 12th International conference, 1995, pp.194-202.
5. C.Elkan. Magical thinking in data mining: lessons from CoIL Challenge 2000.
Proceedings of the 7th International Conference on Knowledge Discovery and
Data Mining (KDD’ 01).
6. J.H.Friedman. On bias, variance, 0/1-loss, and the curse of dimensionality. Data
Mining and Knowledge Discovery, 1997, 1(1), 55-77.
7. D.J.Hand, K.Yu. Idiot’s Bayes – not so stupid after all? International Statistical
Review, 2001, 69, 385-398.
8. C.-N.Hsu, H.-J. Huang, T.-T.Wong. Why discretization works for naïve Bayesian
classifiers. 17th International Conference on Machine Learning (ICML-2000),
2000.
9. R.Kohavi, D.Sommerfield. Feature subset selection using the wrapper model:
Overfitting and dynamic search space topology, in The First International
Conference on Knowledge Discovery and Data mining, 1995, pp. 192-197.
10. R.Kohavi, D.Sommerfield, J.Dougherty. Data mining using MLC++: A machine
learning library in C++, International Journal on Artificial Intelligence Tools,
1997, 6(4), 537-566.
11. R.Kohavi, B.Becker, D.Sommerfield. Improving simple Bayes, in The 9th
European Conference on Machine Learning, Poster Papers, 1997.
5
NESUG 17
Posters
12. R.Kohavi, M.Sahami. Error-based and entropy-based discretization of continuous
features, in Proceedings of the 2nd international conference on knowledge
discovery and data mining, 1996, pp.114-119.
13. T.Niblett. Constructing decision trees in noisy domains. Proceedings of the 2nd
European Working Session on Learning, Bled, Yugoslavia, 1987, pp.67-78.
14. I.Parsa. KDD-Cup 1997 presentation,
http://www.kdnuggets.com/news/97/n25.html#item2.
15. I.Rish. An empirical study of the naïve Bayes classifier, Technical Report
RC22230, IBM T.J. Watson Research Center, 2001.
16. I.Rish, J.Hellerstein, J.Thathachar. An analysis of data characteristics that affect
naïve Bayes performance, Technical Report RC21993, IBM T.J. Watson
Research Center, 2001.
17. Y.Yang, G.I.Webb. A comparative study of discretization methods for naïveBayes classifiers. In Proceedings of PKAW 2002, The 2002 Pacific Rim
Knowledge Acquisition Workshop, Tokyo, Japan, pp.159-173.
18. Y.Yang, G.I.Webb. On Why Discretization Works for Naive-Bayes Classifiers. In
Proceedings of the 16th Australian Conference on AI (AI 2003) Lecture Notes in
Artificial Intelligence, v. 2903, pp 440-452.
ACKNOWLEDGEMENTS
SAS® and all other SAS Institute Inc. product or service names are registered trademarks
or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA
registration.
CONTACT INFORMATION
Vadim Pliner
Verizon Wireless
2000 Corporate Drive
Orangeburg, NY 10962
Email: [email protected]
6