Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER Classification 4 . 1 INTRODUCTION 4 . 2 STATISTICAL-BASEDALGORITHMS 4 . 3 DISTANCE-BASEDATGORITHMS 4 . 4 DECISIONTRE€-BASED ALGORITHMS 4 . 5 NEURAL NETWORK-BASEDALGORlTHMS 4 . 6 RULE-BASEDATGORITHMS 4 . 7 COMBINING TECHNIQUES 4.8 REVIEW QUESTIONS 4 . 1 INTRODUCTION Classification is perhaps the most familiar and most popular data mining technique. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial market trends. Estimation and prediction may be viewed as types of classification. When someone estimates your age or guessesthe number of marbles in a jar, these are actually classification problems. Prediction can be thought of as classifying an attribute value into one of a set of possible classes. It is often viewed as forecasting a continuous value, while classification forecasts a discrete value. Example 1.1 in Chapter 1 illustrates the use of classification for credit card purchases. The use of decisibn trees and neural networks (NNs) to classify people according to their height was illustrated in chapter 3. Before the use of current data mining techniques, classification was frequently performed by simply applying knowledge of the data. This is illustrated in Exarnple 4.1. 2 Classification 73 Chapter4 Data Mining-lntroductory and AdvancedTopics As discussedin [KLR+98]. there are three basic methods used to solve the classification Problern: EXAMPLE 4.I o Specifying boundaries. Here classification is performed by dividing the input space of potential database tuples into regions where each region is associatedwith one class. Tezrcirersclassify students as A. B. C. D. or F based on their marks. By usirrg sinrple bounclaries(60, 70, 80. 90). the folloq,'ingclassificationis possible: 90 ( mark 80 < rnark < 90 70<mark<80 60 ( rnark < 70 mark < 60 For any given class' Ci, P(f1 | o lJsing probability distributions. C,) is the PDF for the class evaluated at one point, f6.t If a probability of occurrence for each class, P(Ct) is known (perhaps determined by a domain expert), then P(Ci)P(t, I Ci) is used to estimate the probability that t; is in class Ci. A B C D F o Using posterior probabitities. Given a data value t1, we would like to determine the probability that t1 is in a class C1. This is denoted by P(Ci I t1) and is called the posterior probability. One classification approach would be to determine the posterior probability for each class and then assign ti to the class with the highest probability. All approachesto performing classification assurnesome knowledge of thc data. Oftcn a training set is used to develop the specific parameters reqrrircrl bv the technique. Train.i.ng dala consist of sample input data as well as the classification assignment for the data. Domain experts may also be used to assistin the process. The naive divisions used in Example 4.1 as well as decision tree techniques are examples of the first modeling approach. Neural networks fall into the third category. The classification oroblem is stated as shown in Definition 4.1: D n p r l t r r r o N 4 . 1 . G i v e n a d a t a b a s eD : { h , t 2 . . . . . t , . } o f t u p l e s ( i t e r n s .r e c o r d s )a n d a s e t o f c l a s s e sC - - { C r , . . . . C - } . t h e classification problem is to define a mapping f : D - C where each li is assigned to one class. A class, C3, contains precisely those tuples r r r a p p e dt o i t ; t h a t i s . C i : { t i | / ( t i ) : C j . I 1 i l n . a n d l i e D } . Oru definition r.iews classification as a mapping from the database to the set of classes. Note that the classesare predefined, are nonoverlapping, and partition the entire database. Each tuple in the database is assigned to cxactly one class. Tire classesthat exist for a classification problem are indeerd e.qtLi,ualenceclasses. In actuality. the problem usually is implenrented in tu,.ophases: 1. Create a specific model by evaluating the training data. This step has as input the training data (including defined classification for each tuple) and as output a definition of the model developed. The rnoclelcreated classifiesthe training data as accurately as possible. 2. Applv the model developed in step 1 by classifying tuples from the target database. Although the second step actually cloesthe cla.ssification(according to the ciefirritionin Defiriition 4.1), rnost researchhas been applied to step 1. Step 2 is oft err st r-aielrtfor\&'ar-d. l0 9 8 'l ClassA 6 10 9 8 7 6 5 A A 3 2 I 0 J 0r23456'78 ClassB ClassC (a) Definition of classes z I 0 (b) Sample database to classify 'il x x x x 0 I (c) Database classified FIGURE4.1: Classificationproblem' Suppose we are given that a database consists of tuples of the form t : (r,g) where 0 < r 3 8 and 0 < y < I0. Figure 4.1 illustrates the classification problem. Figure 4.1(a) shows the predefined classesby dividing the reference space, Figure 4.1(b) provides sample input data, and Figure 4.1(c) sliows the classification ofthe data based on the defined classes. A major issueassociatedwith classification is that of overfitting. If the classification strategy fits the training data exactly it may not be applicable to a broader population of data. For example, suppose that the training 1In this discussion each tuple in the database is assumed to consist ofa single value rather than a set of vralues. Chapter 4 Topics Data Mining-lntroductory and Advanced in this case, fitting the data data has erroneous or noisy data. certainly exactlY is not desired' to performing classifiIn the following sections, various approaches data to be used throughout this cation are examinei t't'tu 4'1 contains chaptertoillustratethevarioustechniques.Thisexampleassumesthatthe problemistoclassifyadultsasshort'medium'ortall'Table4'llistsheight table show two classifications that in meters. The tastiwo columns of this Output2' respectively' The Outputl could be *ua"' tu'U"i"d Outputl and shown below: classification uses the simple divisions 2m<Height --2m l.TmcHeight ( 1.7m Height Tall Medium Short TheoutputZresultsrequireamuchmorecomplicatedsetofdivisionsusing both height and gender attributes' based on the cate' In this chapter we examine classification algorithms gorization*,u",'_i.'Figure4.2.Statisticalalgorithmsarebaseddirectlyon Distance.based algorithms use similarity the use of statistical intrmation. ordistancemeasurestoperformtheclassification,DecisiontreeandNN approachesusethesestructurestoperformtheclassification.Rule.based rules to perform the classification' classification utgoriiir*, generate iflth"n TABLE4.1: Data for Height Classification Name Kristina Jim Maggie Martha Stephanie Bob Kathy Dave Worth Steven Debbie Todd Kim Amy Wynette Gender F M F F F M F M M M F M F F F Height Outputl Output2 1 . 6m 2m 1 . 9m 1.88m 1 . 7m 1.85m 1 . 6m 1 . 7m . 2.2m 2 . 1m 1 . 8m 1 . 9 5m 1 . 9m 1 . 8m 1.75m Short Tall Medium Medium Short Medium Short Short TaIl TaIl Medium Medium Medium Medium Medium Medium Medium Tall TalI Medium Medium Medium Medium TaIl Tall Medium Medium Tall Medium Medium Statistical Distance DT NN Classification 75 Rules FIGURE4.2: Classification algorithm categorization. 4.1.1 lssues in Classification Missing data values cause problems during both the Missing Data. training phase and to the classification process itself. Missing values in the training data must be handled and may produce an inaccurate result. Missing data in a tuple to be classified must be able to be handled by the resulting classification scheme. There are many approaches to handling missing data: r Ignore the missing data. r Assume a ralue for the missing data. This may be determined by using some method to predict what the value could be. o Assume a special value for the missing data. This means that the value of missing data is taken to be a specific value all of its own. Notice the similarity between missing data in the classification problem and that of nulls in traditional databases. Measuring Performance. Table 4.1 shows two different classification results using two different classification tools. Determining which is best depends on the interpretation of the problem by users. The performance of classification algorithms is usually examined by evaluating the accuracy of the classification. However, since classification is often afuzzy problem, the correct answer may depend on the user. Tladitional algorithm evaluation approaches such as determining the space and time overhead can be used, but these approaches are usually secondary. Classification accuracy is usually calculated by determining the percentage of tuples placed in the correct class. This ignores the fact that there also may be a cost associated with an incorrect assignment to the wrong class. This perhaps should also be determined. An OC (operating characterist'i.c) curae or ROC (recei'uer operating characteristi,c) curue or ROC (relatiue operat'ing characterist'ic) curue shows the relationship between false positives and true positives. An OC curve was originally used in the communications area to examine false alarm rates. It has also been used in information retrieval to examine fallout (percentage of retrieved that are not relevant) versus recall (percentage of retrieved that are relevant). In the o(J curve the horizontal axis has the percentage of false positives and the ver'tical axis has the percentage of true positives for a database sample. At the beginning of evaluating a sample, there are none of either category, while at the end there are 100 percent 76 Data Mining-lntroductory and AdvancedTopics Chapter4 of each. when evaluating the results for a specific sample, the curve looks like a jagged stair-step. as seen in Figure 4.3. as each new tuple is either a false positive or a true positive. A rnore smoothed version of the OC curve can also be obtained. 1000/0 the database D and the output'"alues represent the classes.Regressioncan be used to solve classification problems, but it can also be used for other applications such as forecasting. Although not explicitly described in this text. regressioncan be performed using manv different types of techniques, including NNs. In actuaiity, regressiontakes a set of data and fits the data to a formula. Looking at Figure 3.3 in Chapter 3. we seethat a simple l,,inear.regres_ ,sion problem can be thought of as estimating the formula for a straight line (in a two-dimensional space). This can be equated to partitioning the data into two classes.With the banking exarnple, these would be to approve or reject a loarr application. The straiglrt line is the break-even point or the division between the two classes. a<o/ 9 )U"/o - In chapter 2. rve briefly introduced ii^ear regressionusing the formula 25% a:colc1:t11..'*cnrn " 25o/" 50o/" 75o/o False positives 10O% FIGURE4.3: Operating characteristiccurve IABLE 4.2: Confusion lVlatrix Actual Nlembership Short Medium Tall Assignment Short Nledium Tall 0 0 0 4 0 3 2 tr 1 A confusion matrix illustrates the accuracy of the solution to a classification problem. Given m classes,a confusion matrir is an m x rn matrix where entry ci,i indicates the number of tuples fronr D that were assigned to class Ci but where the correct class is C1. Obviously, the best solutions will have only zero values outside the diagonal. Table 4.2 shows a confusion matrix for the heiglrt example in Table 4.1 where the outputl assignment is assumed to be correct and the output2 assignment is what is actually made. 1.2 1.2.1 Classification 77 STATISTICAL-BASED ALGORITHMS Regression Regression problerns deal with estimation of an output value based on input values. When used for classification, the input values are values frosl (4.1) By determining the regression coefficients cotc!,.. . . c, the relationship between the output parameter. g, and the input parameters, 11, . . . , lLn can be estimated. AII high school algebra students are familiar with deterrnining the formula for a straight line, g : mn * b, given two points in the rg plane. They are determining the regressioncoefficients m ancl b. Here the two points represent the training data. Adrnittedly. Exarnple 3.5 is an extremely simple proble*r. However, it illustrates how we all use the basic classification or prediction techniques frequently. Figure 4.4 illustrates the more general use of linear regression with one input value. Here there is a sample of data that we wish to model (shown by the scatter dots) using a linear model. The line generated by the linear regressiontechnique is shown in the figure. Notice, however, that the actual data points do not fit the linear rlodel exactly, Thus, this model is an estimate of what the actual input-output relationship is. we can use the generated linear rnodel to predict an output value given an input value, but unlike that for Example 3.5, the prediction is an estimate rather than the actual output value. If we attempt to fit data that are not linear to a linear model, the results will be a poor model of the data, as illustrated by Figure 4.4. There are many reasons wh;r the linear regression model may not be used to estimate output data. one is that the data do not fit a linear model. It is possible, however. that the data generally do actually represent a linear model, but the linear model generated is poor because noise or outliers exist in the data.. Noise is erroneous data. Outliers are data values that are exceptions to the usual and expected data. Example 4.2 illustrates outliers. In these casesthe observable data may actually be described by the following: a : c o t c 1 ; r 1 + . . . + c n . l : n+ e (4.2) Chapter 4 78 Data Mining-lntroductory and AdvancedTopics Here e is a random error with a mean of 0. As with point estimation, we can estimate the accuracy of the fit of a linear regression model to the actual data using a mean squared error function. Classification Tg finds coefficientsc6,cr so that the squared error is minimized for the set of observable values. The sum of the squares of the errors is &fr L:Dr| : L @ n - c o- c t r r i ) 2 ,-l (4.4) ;-1 Taking the partial derivatives (with respect to the coefficients)and setting equal to zero, we can obtain the least squares est'imates for the coefficients, co and ci. Regression can be used to perform classification using two different approaches: 1. Division: 2. Prediction: value. 4,2 suppose that a graduate level abstract algebra class has 100 students. Sahana consistently outperforms the other students on exams. On the final exam, Sahana gets a grade of 99. The next highest grade is 75, with the range of grades being between 5 and 99. Sahana clearly is resented by the other students in the class because she does not perform at the sarne level they do. She "ruins the curve." If we were to try to fit a model to the grades, this one outlier grade would cause problems because any model that attempted to include it would then not accurately model the remaining data. We illustrate the process using a simple linear regtession formula and assuming ,k points in our training sample. We thus have the following ,t formulas: y i : c o t c l r v I e i , ' i: 1 , . . . , k Formulas are generated to predict the output class Tire first case views the data as plotted in an n-dimensional space without any explicit class values shown. Through regression, the space is divided into regions-one per class. With the second approach, a value for each class is included in the graph. Using regression, the formula for a line to predict class values is generated. FIGURE4.4: Example of poor fit for linear regression. EXAMPLE The data are divided into regions based on class. (4.3) With a simple linear regression, given an observable value (rvi,y1)' ea is the error, and thus the squared error technique introduced in Chapter 3 can be used to indicate the error. To minimize the error, a method of least squares is used to rninimize the least squared error. This approach Exarnple 4.3 illustrates the division process, while Exarnple 4.4 illustrates the prediction process using the data from Table 4.1. For simplicity, we a^ssumethe training data include only data for short and medium people and that the classification is performed using the Outputl column values. If you extend this example to all three classes,you will see that it is a nontrivial task to use linear regressionfor classification. It also will become obvious that the result may be quite poor. EXAMPLE 4.3 By looking at the data in the Outputl columrr from Table 4.1 and the basic understanding that the class to which a person is assigned is based only on the numeric value of his or her height, in this example we apply the liriear regressionconcept to determine how to distinguish between the short and mediuur classes. Figure 4.5(a) shows the points under consideration. We thus have the linear regression formula of A - co * e. This implies that we are actually going to be finding the value for cs that best partitions the height numeric values into those that are short and those that are medium. Looking at the data in Table 4.1. we see that only 12 of the 15 entries can be used to differentiate between short and medium persons. We thus obtain the following values for y; in our training data: { 1 . 6 , 1 . 9 ,1 . 8 8 ,1 . 7 ,1 . 8 5 ,1 . 6 ,1 . 7 ,1 . g ,1 . 9 5 ,1 . 9 ,1 . 9 ,1 . ? 5 } .W e w i s h t o D r i u i m i z e t2 r:L{ i=t t2 :D@, i:l "d2 )ataMining-lntroductory and AdvancedTopics Chapter 4 get Taking the derivative with respect to c6 and setting equal to zero u€ o c o€ 1 o 1 Classification 81 ooqDoo 12 L2 y53.63325.816 Medium -2Dur +l2cs: s i:l i= 1 6 a o Eo.s I o,s "il "il F',f: I F.l,:-| Medium 'n[3] o -'n[- 3 I v = L'186 llu :lU 0 0 1.6 1.8 2 2.2 2.4 Height (a)Shortandmediumheightswithclasses (b) Division FIGURE4.5: Classification using division for Example 4'3' Solving for cq we find that t2 ,: 1.786 we thus have the division between short and medium persons as being determined by g: 1.786.as seenin Figure 4'5(b)' 2 2.2 Height (b) Prediction 2.4 L2 _ co_ cfiu)2 ,2,=\fu; I i: I i:t Taking the partial derivative with respect to cs and setting equal to zero we get Ar # :" t2 L2 -2Lat +lz"o + I i=l i:l t2 2 c 1 r 1 iI: i=l To simplify the notation, in the rest of the example we drop the rauge values for the summation becauseall are the same. solving for cs, we find that: cb: Lat ^: =?: 1.8 We thus wish to minimize l2 \.a 1.6 FIGURE4.6; Classification using prediction for Example 4.4. Short (a) Short and mediumheights Short Lvo -D"r*tn T2 Now taking the partial of tr with respect to c1, substituting the value for cs, and setting equal to zero we obtain aL ^ \-r d"r:2 L\ai - co- clxY)(-rY) : Q Solving for c1, we finally have EXAMPLE D''o Iro 4.4 we now look at predicting the class using the short and medium data as input and looking at the outputl classification. The data are the same as thtse in Example 4.3 except that we now look at the classesas indicated in the training data. Since regression assumesnumeric data. we asslme that the value for the short class is 0 and the value for the medium cla"ssis 1' F i g u r e 4 . 6 ( a ) s h o w st h e d a t a f o r t h i s e x a m p l e : { ( 1 . 6 , 0 ) , ( 1 ' 9 , 1 ) , ( 1 ' 8 8 , 1 ) , ( 1 : 70, ) , 1 i . e sr,; , ( 1 . 6 , 0 ) ,( 1 . 7 , 0 ) ,( 1 . 81, ) , ( 1 . 9 51.) , ( 1 . 91, ) , ( 1 . 8i,) ' (1.75, 1)). In this casewe are using the regressionformula with one variable: A:Co*c1r1*e !i'?n) \4'e can now solve for cs and c' . Using the data from the 12 points in the training data, we have ! ru = 2I.48, Dyo : 8, D(rr&i) : 14.g3, and : 38.42. Thus, we get c1 : 3.63 and cs : ::5.916. The prediction D(t?,) for the class value is thus 9:-5.816*3.6321 This line is plotted in Figure 4.6(b). Data Mining-lntroductory and AdvancedTopics Chapter4 predicts the class value is generated' This In Example 4.4 a line that three but il also could have been done for all wa"sdone for two t1*'"'' obvious is membership where class classes. Unlike tft" Jil'lrlo" approach occurs' with prediction the class point o t"ttittt *itttit' based on the region In obvious' Here we predict a class value' to which a point belongs is less 0.9 0.8 .g 0.-s J landlessthan0.Tlrus,thevcertainlycanrtotbetrsedastheproba bilityof comrnonly Usedregressiontechnique occurrenceofthe target class. Another line' Instead of fitting the data to a straight is called log'isttc ,"g'-""io'' 4.7. Figure in such as is illustrated logistic regression,r".-^ iogi*ic curve ls The formula for a univariate logistic curve (4.6) l+€('r'+'1rr) 0 and 1 so it can be interpreted The togistic curve gives a value between with linear tT::t::1": As as the probability of class mernbership' :il is desired' To perform th classes be used when classification into two be applied to obtain the logisti regression, the logarithmic function can functiort '*" (fi) :711r t'1x'1 & Q . n" ' 4 0.3 0.2 \ \ \ \ \ \ \ \ \ 0.I (4.5) w h e r e f i i s t h e f r r n c t i o r r b e i n g u s e d t o t r a n s f o r m t l r e p r e d i c t o r ' I rtechniques' rtlriscase Linear regressiou in" ,"g.L*i"n is called nonlinear regress'ion' to most cornplex data nrining while easy to understani' are not aiplicable with nonnumeric data' They also applications. They <1onot rvork well nraketheasstrmptionthattherelationshiplbetweent}reinputvalueandthe rnay not be the case' output value is lirrear, which of course becausethe data may not fit Linear regressionis not always appropriate line values can be greater than tr straight line, but olro tr".rr,r" itu,truig1rt t(co*crrl) ---"-- \\ \ i u.o Ifthepre<lictorsintlrelirrearregressionfurrctionaremodifiedbysome the model looks like function (square. square root, etc')' tlien tn: - r ) l l + e \ p ( l+ r l ) \lrptt !'xp(l - ()/(l + trp(l - r))- - 0.7 F i g u r e 4 . 6 ( b ) t h e c l a s s " v a l u e i s p r e d i c t e d b a s e c i o n t h e h e i gmembership l r t v a l u e a l o nise . ltowever' the class Since the prediction liue is continuous' the prediction for--ayalu3 is 0'4' what not always on.'io"u' For example' if woul<litsclassbe?Wecandeternrirrethecla..sllysplittirrgtlreline.Soa heightisintheslrortclassifitspredictionvalueislessthan0.5anditisin t h e r r r e d i u n r c l a s s i f i t s v a l u e i s g r e a t e r t l r a r r 0 . 5 . I r t E x a m p lbetween e 4 . 4 t h ethe value is 1'74' Thus' this is really the division of 11 where y:0S short class and the ttrediutrt ciass' y : c o + , f r ( r r+) " ' + f " ( r " ) Classification 83 (4.7) Herepistlreprobabilityofbeirrgirrtlrec}assarrdl-llistlreprobabilitythat ftrr c9 ancl r'l1that rnaxirnize it is not. However, tt".'p-""rr"arooses,"alues valtles' given the probabilitv of otrserving the FIGURE4.7: Logistic curve. 4.2.2 Bayesian Classification Assurning that the contributiou b1' all attribrrtes arc irrdependent and that each contributes equally to the classific'atiouproblenr. rr,sinrplc classification sclrenrecalled naiue Buges classification has been proposed that is based on Brryes rule of conclitional protru,bility as statecl irr Definitiou 3.1. This approach was briefly outliued in Cirapter 3. 81' analy'zing the contribution of each "indepenclent" attribute. ti conditional probabilitt' is cletermined. A classification is nrade by corubirring the irnpact that the clifferent attributes have on the prediction to be tnatle. Tlier aprproac'his ca.llecl"naive" because it assunresthe independence betlreen the various attribute values. Givert a data value a;; the probabilitv thrrt a relatecl tuplc. li. is irr class C, is describedbv P(C'; lrr;). Trairring cl:rtacan be usccito rleterrrriueP(t,), P(ri I C,). arrcl P(Ci) Fr<lrnthese values. Baves theorenr alllowsus to e s t i n i a t et h e p o s t e r i o rp l o b a b i l i t v P ( C , l . r ' ; ) a r r c lt h e n P ( C ; l t i ) . Giverra trairiirrgsert.the rraiveBal.esalgoritlinr first estinratestht'prior probabilitl P(C;) for each class lty counting horn ofteu etrcli class occurs in thc tr:rinirrg clata. For euch attributc. :r:,. thc rrrunberrof occurrerrces of erachattliltrtte value .r'; c:rn be c:orrutcdto tlett'rnritrcP(.r,). Sirrrilarlr'. the plobabilitl P(r', ) C;) can bc cstirnated by t--ountingholv ofterr eaclr I'ahte r,rcc:ttLs in tire classin the trairring <lata. Nofc that \\.eilt'c lookirrg at attriliute valttes tiere. A tuple irr the trairring rlat:r nrzn'ir:rvt'rrr:rr11. cliff<.r't'nt attributes. each with tnatrt-valtrcs.This rrrustlrt' ilonc frx all tittrilmtes arrd all ralues of attrilrutes. \Ve therr use theseclerivcrllrrolrabiliticswiren a rre'r,r. tuple rnust be classified. Tllt i. whv naive Bal't-.sclassilicatiorir,anbe vierw'ed as troth a descriptive and a predictive type of aigoritiurr. Ihe probabilities Chapter4 Data Mining-lntroductory and AdvancedTopics are descriptive and are then used to predict the class membership for a target tuple. When classifving a target tuple, the conditional and prior probabilities generated frorn the training set are used to make the prediction. This is done by combining the effects of the different attribute values from the tuple. Supposethat tuple t.; has p independentattribute values {rur,rnz,. '. 'rip} From the descriptive phase, we know P(r* | Ci), for each class C, and attribute ri;,. We then estimate P(ti lCi) bV P(tilcr):fle@t"lc,) (4.8) A:1 TABLE4.3: Probabilities Associated rvith Attributes Attribute Gender Height Value \T Count Probabilities Short l\Iedium TalI 1 2 6 0 0 3 0 0 0 0 0 1 2 F 3 ( 0 ,1 . 6 1 2 (1.6,1.71 2 ( 1 . 71, . 8 1 0 (i.8,1.91 0 (1.9,2l 0 (2, co) At this point in the algoritlim, we then have the needed prior probabilities P(C) for each classand the conditional probability P(ti I C1). To calculate P(ti), we can estimate the likelihood that t6 is in each class. This can be done by finding the likelihood that this tuple is in each classand then adding all these values. The probability that ti is in a class is the product of the conditional probabilities for each attribute value. The posterior probability P(Ci I ti) is then found for each class. The classwith the highest probability is the one chosen for the tuple. Example 4.5 illustrates the use of naive Baves classification. Classification 85 0 .f ^ 1 0 Short Medium tl4 314 218 618 313 013 2/4 00 00 318 418 r/8 0 o o rl3 2/3 , /A 0 0 0 0 Tall Combining these, we get Likelihood of being short : Likelihood of being mediurn : Likelihood of being tall : 0 x 0.267 : 0 0.031 x 0.533 : 0.0166 0.33 x 0.2 : 0.066 ( 4.e) (4.10) (4.11) We estimate P(t) by sunrming up these individual likelihood values since I will be either short or medium or tall: P(t) :0 + 0.0166+ 0.066: 0.0826 14.r2) Finally, we obtain the actual probabilitiesof eachevent: EXAMPLE 4.5 P ( s h o r tl t ) Using the Outputl classification results for Table 4.1. there are four tuples classified as short, eight a; medium. and three as tall. To facilitate classification, we divide the height attribute into six ranges: ( 0 ,1 . 6 1( 1, . 61, . 7 1( 1, . 71, . 8 1( 1. . 81, . 9 1( 1, . 92, . 0 1( 2, . 0m ,) Table 4.3 shows the counts and subsequentprobabilities associatedwith the attributes. With these training data, we estimate the prior probabilities: P(short) : 4lI5 : 0.267,P(medium) : 8/15 : 0.533, and P(tall) :3115:0.2 We use these valuesto classify a new tuple. For example. supposewe wish to classify 1: (Adam,.1'l.1,1.95 m) . By using these values and the associated probabilities of gender and heiglrt, we obtain the following estimates: P(f lshort): P ( f l m e d i u m ): P(t I tall) : ll4 x 0:0 2 1 8x 1 / 8 : 0 . 0 3 1 3l3x 1/3: 0.333 P ( m e d i u ml i ) P ( t a l rl r ) : 0 x 0.0267 0.0826 x 0.533 0.031 _-il., 0.0826 0.333x 0.2 : 0.799 0.0826 _:ll (4.13) (4.14) (4.15) Therefore, based on these probabilities, we classifl' the new tuple as tall becauseit has the highest probability. The naive Bayes approach has severa.l advantages. First, it is easv to use. Second, unlike other classificatiou approaches, only one scan of the training data is required. The naive Bayes approach can easily handle missing values by sipply omitting that probability when calculating the likelihoods of membership in each class. In cases where there are sirnple relationships, the technique often does yield good results. Although the naive Bayes approach is straightforward to use, it does not always yield satisfactory results. First. the attributes usually are not independent. we could use a subset of the attributes by ignoring any rhat are dependent on others. The technique does not handle continuous data. Dividing the continuous values into ranges coulcl be used to solve this probIem, but the division of the domain into ranges is not an easy task, and how this is done can certainly impact the results. Chapter4 ALGORITHMS DISTANCE.BASED Alcomtnu Each itelr that is rnapped to the same cla.ssmay be thought of as rnore similar to the othel itcrus in that class than it is to the items found in other classes. Therefore, similarity (or distance) measuresmay be used to identify the "alikeness" of different items in the database. 4.1 Input: //Centers for each class c1,...,c7t t / /Irrput tuP16 to classifY Output: c / /Class to which t is assigaed Sinpl e distance-based al gori thm dist : oo; fori::1tonrdo i f dis(ci, t) < aist, then c: ii dist : dist(c4, t); Using a similartty rneasurefor classification where the classesare preclefinedis somewhat sinrplcr than using a similarity measure for clustering where the cla^sses are not known in advance. Again, think of the IR example. Each IR. query provides the classdefinition in the form of the IR query itself. So the cla,ssificationproblenr then becomes one of determining siurilarity not among all tuples in the database but between each tuple and the query. This makes the problern arr O(n) ploblen rather rhan an O(n2) problem. Figure 4.8 illustrates the use of this approach to perform classification using the data found in Figure 4.1. The three Iarge dark circles are the class representatives for the three classes. The dashed lines show the distance from each item to the closest center. Simple Approach Usirrg the IR approach, if we have a representative of each class, we can perforrn classificationby assigningeach tuple to the class to which it is most similar. We assume here that each tuple, 11,in the database is defined as a vector (trt,tir,. . . ,tix) of nunreric values. Likewise, we assume that each class C7 is defined by a tuple (Cir,Ciz,...,Cid of numeric values. The classification problem is then restated in Definition 4.2. t0 *-------9u -::---''' I 8 _- - - - - :=f'===- - 1a 7 DprrNlrrou 4.2. Given a databaseD : {tr,tz,. . . . ,fr} of tuples where each tuple tt. : (tn,ti2,. .. , i;6) contains numeric values and a set of classeC s : { C t , . . . , C ^ } w h e r e e a c hc l a s sC i : ( C i t , C 1 z , . . . , C i , r ) has numeric values, the classification problem is to assign each l; to the classCi such that (t1,C;) > sim(i;,Ct)ye € C where e * Ci. 6 5 x\ x i\ J To calculate these similarity measures, the representative vector for eachclassmust be determined. Referring to the three classesin Figure 4.1(a), we can determine a representative for each class by calculating the center of each region. Thus class A is representedby \4,7.5), class B by (2,2.8), and class C by (6,2.5). A simple classification technique, then, would be to place each item in the class where it is most similar (closest) to the center of that class. The representative for the class may be found in other ways. For example, in pattern recognition problems, a predefined pattern can be used to represent each class. once a similarity mea"sureis defined, each itenr to be classifiedwill be cornpared to each predefined pattern. The item will be placed in the class with the largest similarity value. Algorithm 4.1 illustrates a straightforward distance-based approach assuming that each class, q, is represented by its center or centroid. In the algorithm we use c; to be the center for its class. since each ttrple must be compared to the center for a class and there are a fixed (usually small) number of classes. the complexitv to classify one tuple is O(n). Classification 87 C "]s X - . x 2 I ClassB o- 0L FIGURE4.8: Classification using simple distance algorithm. 4.3.2 K Nearest Neighbors One common classification scheme based on the use of distance measures is that of the K nearest neighbors (KNN). The KNN technique assumes that the entire training set includes not only bhe daba in the set but also the desired classification for each item. In effect, the training data become the model. When a classification is to be made for a new item, ibs distance to each item in the training set must be determined. Only the K closest entries in the training set are considered further. The new item is then placed in the class that contains the most iterns from this set of K closest items. Figure 4.9 illustrates the process used by KNN. Here the points f8 Chapter 4 Data Mining-lntroductory and AdvancedTopics 10 x x lt c: x----l begin w : l t /- { u } ; irl: l'U {d}; end //Find class for classification class to wbich the most u € iV are classified; Example 4.6 illustrates this technique using the sample data from Table 4.1. The KNN technique is extremely sensitive to the value of K. A rule of thumb is that 1( < @ [KLR+98]. For this example, that value is 3.46. Commercial algorithms often use a default value of 10. x EXAMPLE FIGURE4.9: Classification using KNN. in the training set are shown and K : 3. The three closest items in the training set are shown; t will be placed in the class to which most of these are members. Algorithm 4.2 outlines the use of the KNN algorithm. We use ? to rep resent the training data. Since each tuple to be classifiedmust be cornpared to each element in the training data, if there are q elements in the training set, this is O(q). Given n elements to be classified, this becomes an O(nq) problem. Given that the training data are of a constant size (altho perhaps quite large), this can then be viewed as an O(n) problem. data //Training //Nunber of neighbors tuple to classify //Itput / /Class to whicb t is assigued KNN algorithn: to classify tuple using KNN //Algoritbn N: 0 ; for /lFilid set of neighbors, each d€ T do i f I r\rl< K, then trl: IvU {d}; else if 4.6 Using the sample data from Table 4.1 and the Outputl classification as the training set output value, we classify the tuple (Pat, F, 1.6). Only the height is used for distance calculation so that both the Euclidean and Manhattan distance measures yield the same results; that is, the distance is simply the absolute value of the difference between the values. Suppose that K : 5 is given. We then have that the K nearest neighbors to the input tuple are {(Kristina,4 1.6),(Kathy,4 1.6), (Stephanie,F,1.7), (Dave, tr[,L7], (Wynette, 41.75)]. Of these five items, four are classified as short and one as medium. Thus, the KNN will classify Pat as short. 4.4 D E C I S I O N T R E E - B A S E D A L G O R I T H M S The decision tree approach is most useful in classification problems. With this technique, a tree is constructed to model the classification process. Once the tree is built, it is applied to each tuple in the database and results in a classification for that tuple. There are two basic steps in the technique: building the tree and applying the tree to the database. Most research has focused on how to build effective trees as the application process is straightforward. Ar,conrrnrvr 4.2 Input: T K t 0utput: c Classification 89 JV, for t I u € i V s u c h t h a t s i n ( t , u ) < s i n ( t , d ), t h e n The decision tree approach to classificationis to divide the searchspace into rectangular regions. A tuple is classified based on the region into which it falls. A definition for'a decision tree used in classification is contained in Definition 4.3. There are alternative definitionsl for example, in a binary DT the nodes could be labeled with the predicates themselvesand each arc would be Iabeled with yes or no (like in the "Twenty Questions" game). DorrNrrroN 4.3. Given a database D : {tr,. . . ,tn} where f2 : (tt,. . ., til) and the database schemacontains the following attributes s : {Ct,...,C*}. A { A r , A z , . . . , A n } . A l s o g i v e ni s a s e t o f c l a s s e C decision tree (DT) or classification tree is a tree associatedwith D that has the following properties: Chapter4 Data Mining-lntroductory and AdvancedTopics o Each internal node is labeled with an attribute, Ai. r Each arc is labeled with a predicate that can be applied to the attribute associatedwith the parent. o Each leaf node is labeled with a class, C3. Solving the classification problem using decision trees is a twostep process: 1. Decision tree induction: Construct a DT using training data. 2. For each fi e D, apply the DT to determine its class. Based on our definition of the classification problem, Definition 4.1, the constructed DT representsthe logic neededto perform the mapping. Thus, it irnplicitly defines the mapping. Using the DT shown in Figure 3.5 from Chapter 3, the classification of the sample data found in Table 4.1 is that shown in the column labeled Output2. A different DT could yield a different classification. Since the application of a given tuple to a DT is relatively straightforward, we do not consider the second part of the problem further. Instead, we focus on algorithms to construct decision trees. Several algorithms are surveyed in the following subsections. There are many advantages to the use of DTs for classification. DTs certainly are easy to use and efficient. Rules can be generated that are easy to interpret and understand. They scale well for large databases because the tree size is independent of the database size. Each tuple in the database must be filtered through the tree. This takes time proportional to the height of the tree, which is fixed. tees can be constructed for data with many attributes. Ar,conruru 4.3 Input: D ,/,/Training data Output: T //Decision tree D T B u il d a l g o r i t h m : algorithro to i]lustrate //Simplistj.c to buildiug DT naive approach T: A; Deterroine best splitting criterion; T: Create root node node atrd label with splitting attribute; T: Add arc to root node for each split predi-cate and 1abe1; for each arc do D: Database created by applying spli.tting predicate to D; if stopping point reached for this path, then d: Create leaf node and label with appropriate claas; Classification .91 el se 1{ : DTBuild(D); I: Add I' to arc; Disadvantages also exist for DT algorithms. First, they do not easily handle continuous data. These attribute domains must be divided into categories to be handled. The approach used is that the domain space is divided into rectangular regions [such as is seen in Figure 4.1(a)]. Not all classification problems are of this type. The division shown by the sinrple loan classification problem in Figure 2.4(a) in Chapter 2 cannot be handled by DTs. Handling missing data is difficult because correct branches in the tree could not be taken. Since the DT is constructed from the training data. overfitting may occur. This can be overcome via tree pruling. Finally, correlations among attributes in the database are ignored by the DT process. The major factor in the performance of the DT building algorithm is the size of the training set. The following issues are faced by most DT algorithms: o Choosing splitting attributes: Which attributes to use for splitting attributes impacts the performance applying the built DT. Some attributes are better than others. In the data shown in Table 4.1, the name attribute definitely should not be used and the gender rnay or may not be used. The choice of attribute involves not only an examination of the data in the training set but also the informed input of domain experts. o Ordering of splitting attributes: The order in which the attributes are chosen is also important. In Figure a.10(a) the gender attribute is chosen first. Alternatively, the height attribute could be chosen first. As seen in Figure 4.10(b), in this case the height attribute must be examined a second time, requiring unnecessary comparisons. r Splits: Associated with the ordering of the attributes is the number of splits to take. With some attributes, the domain is small, so the number of splits is obvious based on the domain (as with the gender attribute). However, if the domain is continuous or has a large number of values, the number of splits to use is not easily determined. o Tbee structure: To improve the performance of applying the tree for classification, a balanced tree with the fewest levels is desirable. However, in this case, more complicated comparisons with multiway branching fseeFigure 10(c)] may be needed. Some algorithms build only binary trees. r Stopping criteria: The creation of the tree definitely stops when the training data are perfectly classified. There rnay be situations when stopping earlier would be desirable to prevent the creation of larger trees. This is a trade-off between accuracy of crassification 92 Data Mining-lntroductory and AdvancedTopics Chapter4 and performance. In addition, stopping earlier mav be per to prevent overfitting. It is even conceivable that more levels needed would be created in a tree if it is known that there are d distributions not representedin the training data. o Training data: The structure of the DT created depends on training data. If the training data set is too small. then the tree might not be specific enough to work properly with the general data. If the training data set is too large, then the tree may overfit. o Pruning: Once a tree is constructed, some modifications to the might be needed to improve the performance of the tree during classification phase. The pruning phase might remove redundant parisons or remove subtrees to achieve better performance. In the following subsectionswe examine several popular DT approaches. Height Gender Short Gender Heighl Height <r32lt\8m .''Z'fy Short Medium .=t.8m,/ /\ Tall Short Medium Tall Medium n \.t.9In < 1 . 5m Tall Short .3 1.5 1.82 (a) Balanced tree (b) Deep tree >2m >=1.3 r < 1 . 5m / .5m >1.8 .8m Medium Gender =F,/ \=r'a /\ Medium Short 1.3 1.5 |.8 2.0 (c) Bushytree FIGURE4.10: Comparing decision trees. 4.4.1 lD3 The ID3 technique to building a decision tree is based on information theory and attempts to minimize the expected number of comparisons. The basic idea of the induction algorithm is to ask questions whose answersproyide the most information. This is similar to the intuitive approach taken by adults when plaf ing the "Twenty Questions" game. The first question an adult might a^skcould be "Is the thing alive?" while a child might ask ''Is it my Daddy?" The first question divides the search space into two Iarge search domains, while the second performs Iittle division of the space. The basic strategy used by ID3 is to choose splitting attributes with the highest information gain first. The amount of information associatedwith z,n attribute value is reiated to the probability of oecurrence. Looking at the "Twenty Questions" example, the child's question divides the search space into two sets. One set (Daddy) has an infinitesimal probability associated with it and the other set is almost certain, while the question the aduit makes divides the search space into two subsets with almost equal orobabilitv of occurrins. Tall ,"-\ Height Classification 93 <=1.5m Short Medium The concept used to quantify information is called entropy. Entropy is used to measure the amount of uncertainty or surprise or randomness in a set of data. Certainly, when all data in a set belong to a single class, there is no uncertainty. In this case the entropy is zero. The objective of decision tree classification is to iterativeiy partition the given data set into subsetswhere all elements in each final subset belong to the same class. In Figure 4.11(a, b, and c) will help to explain the concept. Figure 4.11(a) shorvs log(l/p) as the probability p ranges from 0 to 1. This intuitively sholvs the amount of surprise based on the probability. When p : 1, there is no surprise. This means that if an event has a probability of 1 and you are told that the event occurred, you would not be surprised. As p - 0. the surprise increases. When we deal with a divide and conquer approach such as that used with decision trees, the division results in multipie probabilities whose sum is 1. In the "Twenty Questions" game, the P(Daddy) < P(-Daddy) and P(Daddy) * P(-pu66y) : 1. To measure the information associatedwith this division, we must be able to combine the information associated with both events. That is, we must be able to calculate the averageinformation associatedwith the division. This can be performed by adding the two values together and taking ilto account the probability that each occurs. Figure 4.11(b) shows the function plog(Ild, tvhich is the expected information based on probability of an event. To determine the expected information associatedwith two events, we add the individnal vaiuestogether. This function plog(lld + (1 - p) Iog(1/(1 - p)) is plotted in Figure a.11(c). Note that the maximurn occurs when the two probabilities are equal. This supports our intuitive idea that the more sophisticated questions posed by the adult are better than those posed by the child. 4 0.2 0.4 0.6 0.8 (a) log(l/p) u.) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Classification 95 Chapter4 Data Mining-lntroductory and AdvancedTopics Heighr <=1.7m A <=t.95m/ \ > 1 . 7m <=1.95m >t..l5n /\ t"" (a)original,r.J"o'u* o.2 0.4 0.6 0.8 (b)p log(1/p) 0 0.2 0.4 0.6 0.8 (c)H(p,1 - p) FIGURE4.11:Entropy. Drrnrrrrorlr 4.4. Given probabilities pt,pz, . entropy is defined as where!i-, pi : l, s (4.r6) Given a database state, D. fI(D) finds the arnount of order (or lack thereof) irr that state. When that state is split into s rrew states S : w e c a n a g a i n l o o k a t t h e e n t r o p y o f t h o s es t a t e s . E a c h {Dt,Dr,...,D"}, step in ID3 choosesthe state that orders splitting the most. A database is completely ordered if all tuples in it are in the sanre class. ID3 chooses the splitting attribute with the highest gain in information. where gain is defined as the differencebetween how much information is neededto rnake a correct classification before the split versus how much information is needed after the split. Certairily, the split should reduce the infomiation needed by the largest amount. This is caiculated b1' rieterrnirring the differences between the entropies of the original dataset and the weighted sum of the entropies from each of the subdivided datasets. The entropies of tire split datasets are weighted by the fraction of the dataset beirrg placed in that division. The ID3 algorithm calculates the qairt of a particular split by the follorving fcrrmula: _j_ G a i r r ( DS. ) : H ( D ) - L P(n,)II(D,) FIGURE4.12: Classification problem. EXAMPLE The formal definition of entropy is shown iu Definitiorr 4.4. The value entropy is between 0 and 1 and reachesa rnaximurn when the probabilit are all the sanre. (r.17) Exanrple 4.7 arrd assciciatedFigure 4.12 ilhrstrate this processrrsing the heigirt exantple. In this exarnple. six clivisions of the possible r.arrgesof heigirts are uscd. This division into ranges is neetled u'hen the donrain of aIr attriltute is contiurtortsor (as in this case) consists of rnany possible valtres. While the chtiice of these divisions is sonrewhat arbitrary, a donrain cxpert shorrldbe able to perfornr the task. (b) Optimizedtree 4.7 The beginning state of the training data in Table 4.1 (with the Outputl is that (4115) are short, (8/15) are medium, and (3/15) are clrnssification) tall. Thus. the entropy of the starting set is allt5rog(t\14)+ 8/1slog(15/8) + 3lrs log(15/3): 0.4384 Choosing the gender as the splitting attribute, there are nine tuples that are F and six that arc M. The entropy of the subset that are ,F is 3/s log(9/3)+ 6/s log(s/6): o.2764 (4.18) whereas that for the ,4/ subset is t / 6 l o 9 ( 6 / 1 ) + 2 1 6 I o 9 ( 6 1 2 )+ 3 l 6 l o 9 ( 6 / 3 ) : 0 . 4 3 e 2 (1.1e) The ID3 algorithm must determine what the gain in information is by using this split. To do this. we calculate the weighted sutn of these last two entropies to get :0.34152 ((e/15)0.2764) + ((6/15)0.43e2) (4.20) Tlie gain in entropy by usingthe genderattribute is tlius - 0.34152: 0.09688 0..1384 (4.21) Looking at the heigirt attribute, we have two tuples that are 1.6, two are 1.7, otre is 1.75,two are 1.8. oue is 1.85,orreis 1.88.two are 1.9, one is 1.911, one is 2, one is 2.1, and one is 2.2. Deterruining the split values for heiglrt is not easy. Even though the training dataset has these 11 values. we krrow that there will be manv lrrore. Just as rvith corrtirruousdata. we divide into ranges: ( 0 . 1 . 6(11,. 6 . 1 . (7r1. 7, 1, . 8 1 ( r.. 8 , 1 . e( 1 ] .. e . 2 . (021..0c.c ) T h e r e a r e 2 t u p l e s i n t h e f i r s t d i v i s i o nw i t h e n t r o p y ( 2 1 2 ( 0 )+ 0 + 0 ) : 0 , 2 in (1.6, 1.71with entropy (2/2(O) + 0 + 0) : 0, 3 in (1.7, 1.81with entrop;. DataMining-lntroductoryandAdvanced Topics Chapter4 ( 0 + 3 / 3 ( 0 ) + 0 ) : 0 , 4 i n ( 1 . 8 ,1 . 9 w 1 i t h e n t r o p (y0 + 4 / 4 ( 0 )+ 0 ) : 0 , 2 i n : 0.301,andtwo in the (1.9,2.01with entropy(0+ 1/2(0.301)+ 1/2(0.301)) last with entropy (0 + 0 + 212(0)): 0. All of thesestatesare completely orderedand thus an entropy of 0 exceptfor the (1.9,2.0]state. The gain in entropy by usingthe height attribute is thus - 2/15(0.30r) : 0.3983 0.4384 Classificatioi gl Splitting: The ID3 approach favors attributes with rnany divisions and thus rnay lead to overfitting. In the extreme, an attribute that has a unique value for each tuple in the training set wr_ruldbe the best because there would be only one tuple (and thus oue class) for each division. An improvement can be made by taking into account the cardinality of each division. This approach usesthe GainRatio as opposed to Gain. The GainRatio is defined as (4.22) Thus, this has the greater gain, and we choosethis over gender as the first splitting attribute. Within this division there are two males, one medium and one tall. This has occurred because this grouping was too large. A further subdivision on height is needed, and this generatesthe DT seen in Figure a.l2(a). GainRatio(D. ^9): ,(w #) Gain(D,^9) (4.23) For splitting purposes, c4.5 usesthe largest GainRatio that ensuresa larger than average information gain. This is to compensate for the fact that the GainRatio value is skervedtoward splits where the size of one subset is close to that of the starting one. Example 4.8 shows the calculation of GainRatio for the first split in Example 4.7. Figure a.l2(a) illustrates a problem in that the tree has multiple splits with identical results. In addition, there is a subdivision of range (1.9,2.01. Figure 4.12(b) shows an optimized version of the tree. t.2 c4.5 EXAMPLE 4.8 The decision tree algorithm C4.5 improves ID3 in the following ways: To calculate the GainRatio for the gerrder split, we first find the entropy associatedwith the split ignoring classes o Missing data: When the decision tree is built, missing data are simply ignored. That is, the gain ratio is calculated by looking only at the other records that have a value for that attribute. To classify a record with a missing attribute value, the value for that item can be predicted based on what is known about the attribute values for the other records. "(* (4.24) (4.25) The entropy for the split on height (ignoring classes)is ,(+** +*) c4.5: r Rules: c4.5 allows classification via either decision trees or rures gerrerated from them. In addition, some techniques to simplify complex rules are proposed. One approach is to replace the left-hand side of a rule by a sinrpler version if all records in the training set are treated identically. An "otherwise" type of lule carr be used to indicate what should be done if no other rules apply. :02e2 0.09688 : 0.332 0.292 There are two primary pruning strategies proposed in - With subtree replacemenl, a subtree is replaced by a leaf node if this replacement results in an error rate close to that of the original tree. Subtree replacement works from the bottom of the tree up to the root. - Another pruning strategy, called subtree ra'ising, replaces a subtree by its most used subtree. Here a subtree is raised from its current location to a node higher up in the tree. Again, we rnust determine the increase in error rate for this replacernent. :*'o*(f) .**(f) This gives the GainRatio value for the gender attribute as o Continuous data: The basic idea is to divide the data into ranges based on the attribute values for that item that are found in the training sample. o Pruning: *) 4.4.3 (4.26) CART Classification and regression trees (CART) is atechnique that generates a binary decision tree. As with ID3, entropy is used as a rneasureto choose the best splitting attribute and criterion. unlike IDJ, however. where a child is created for each sr-rbcategory,only two children are created. T6e splitting is perfornred around what is cleterrninedto be the best split point. At each step. an exhaustive searchis used to deterrnine the best split. where "best" is defined by o(s/t) :ZPrPn\l j:\ e { 4 | t i l - P ( c , I t p 11 (4.27) Chapter4 Data Mining-lntroductory and AdvancedTopics This formula is evaluated at the current node, t, and for each possible splitting attribute and criterion, s. Here tr and -R are used to indicate the teft ana right subtrees of the current node in the tree. Pt, Pn are the probability that a tuple in the training set will be on the left or right side of the tree. This is defined* #ffi?*i*##. we assumethat the right branch is taken on equality. p(.Ci I t1) or P(C1 | tn) is the probability that a tuple is in tliis class, Ci, and in the left or riglrt subtree' This is defined ., ltuplesof clm1l-I.:g!11sgl. At each step. orrly one criterion is chosen as tne target nodel it"[ieiAt-tr" as the besi over all p6ssible criteria. Example 4.9 shows its use with the height example with Outputl results. EXAMPLE 4.9 The first step is to determine the split attribute and criterion for the first split. we again assume that there are six subranges to consider with the height attribute. Using these rallges, we have the potentiai split values of 1.6,1.7,1.8,1.9,2.0. We thus have a choice of six split points, which yield the following goodness mea"sures: i D ( G e n d e r ): 2 ( 6 / 1 5 ) ( 9 1 1 5 ) ( 2 1 1+5 4 l L 5 + 3 / 1 5 ) : 0 ' 2 2 4 o(1.6) : 0 o ( 1 . 7 ): 2 ( 2 1 r 5 ) ( r 3 l 1 5+) (801 r 5+ 3 / i 5 ) : 0 . 1 6 e 5 3l15): 0'385 r 51 1 + o ( 1 . 8 ): 2 ( 5 1 : 5 ) O 0 l 1 5 ) ( 1+1 6 + 2l15+ 3/15): 0'256 o(1.e) : 2(sl15)(61r5)(4115 + 8/15+ 3/15)- 0'32 o(2.0) : 2(121r5)(31r5)(41t5 (4'28) u.29) (4.30) (4'31) (4'32) (4'33) The largest of these is tire split at 1.8. The remainder of this example left as an exercise. Since gender is really unordered, we assurneM < F. As illustrated with the gender attribute, CART forces that an orderi of the attributes be used. cART handles missing data by simply ignori that record in calculating the goodnessof a split on that attribute. tree stops growing when no split will improve the performance. Note t even though it is the best for the training data, it may not be the for all possible data to be added in the future. The CART algorithm contains a pruning strategy. which we will not discuss here but which be fouud in IKLR+981. 1.4.4 Scalable DT Techniques We briefly examine sonte DT techniques that address creation of DTs for scalability datasets. The SPRINT (Scalable PaRallelizabte INducti,on of decision Trees) algorithm addlesses the scalability issue by ensuring that the CART Classificatidn 99 technique can be applied regardless of availability of main memory. In addition, it can be easily parallelized. With SPRINT, a gini index is used to find the best split. Here gin'i for a database D is defined as gini(D): t -Do? (4.34) where pi is the frequency of class C, in D. The goodness of a split of D into subsets D1 and D2 is defined bY : gini"o11,(D) + P(gi,'i(Dr)) f {ri"i{,,)) (4.35) The split with the best gini value is chosen. Unlike the earlier approaches, SPRINT does not need to sort the data by goodness value at each node during the DT induction process. Witir contiuuous data, the split point is chosen to be the midpoint of every pair of consecutive values from the training set. By maintaining aggregate metadata concerning database attributes, the RainForest approach allows a choice of split attribute without needing a trainirrg set. For each node of a DT, a table called the att'ri'b'ute-ualue class (AVC ) Iabel grouyt is used. The table surnrnarizesfor au attribute the count of entries per class or attribute value grouping. Thus, the AVC table summarizes the information needed to determine splitting attributes. The size of the table is not propoltional to the size of the database or training set, but rather to the product of the rrumber of classes.unique attribute values, and potential splitting attributes. This reduction in size (for large training sets) facilitates the scaling of DT induction algorithms to extremely large training sets. During the tree-building phase, the training data are scanned, the AVC is built, and the best splittirrg attribute is chosen. The algorithm continues by splitting the training data and constructing the AVC for the next node. 4.5 NEURAL NETWORK-BASED ALGORITHMS S/ith neural networks (NNs). just as n'ith decision trees. a model representing how to classify anv given database tuple is constructed. The activation functions typicallv are sigmoidai. When a tuple must ber classifiecl.certain attribute values from that tuple are irrput into the directed graph at the corresponding source nodes. There often is one sink node f<rr each class. The output value that is gerreratetl indicates the probability that the corresponding input tuple belongs to that class. I'he tuple will then be assignedto the class with the highest probability of rlernbersirip. The learning process rnodifies the labeling of the arcs to l)etter classify tuples. Given a starting stlucture and'alue for all the labels in the graph. as each tuple in the training set is sent through the rretwork, tlie projected classification made by the graph can be cornpared with the actual classific:rtion. Based on the accuracy of the prediction, va.riouslabelings in the graph carr ll DO Data Mining-lntroductory and AdvancedTopics Chapter4 Classification 101 types of activation functions hnique for adjusting the weiglrts is many approachescan be used, Although technique. learning calletl the of backpropagation, which form is sorne approach commoll the nost subsection' a subsequent in is discussed The learning may stop when all the training tuples have o stop: propagated through the network or may be based on time or error 1. Deterrnine the number of output nodes as lvell as what attributes should be used as input. The number of hidden layers (between the source and the sink nodes) also must be decided. This step is performed by a domain exPert. 2. Determine weights (labels) and functions to be used for the graph. rate. 3. For each tuple in the training set, propagate it through the network and evaluate the output prediction to the actual result. If the prediction is accurate, adjust labels to ensure that this prediction has a higher output weight the next time. If the prediction is not correct, adjust ttre weights to provide a lower output value for this class. 4. For each tuple ti e D, propagate ti through the network and make the appropriate classification' There are many advantagesto the use of NNs for classification: o NNs are more robust than DTs becauseof the weights' o The NN improves its performance by learning. This may continue even after the training set has been applied' o The use of NNs can be parallelized for better performance' r There is a low error rate and thus a high degree of accuracy once the appropriate training has been performed' There are many issuesto be examined: (number of source nodes): This is the same issue as o Attributes determining which attributes to use as splitting attributes' o Number of hidden layers: In the simplest case, there is only one hidden layer. o Number of hidden nodes: choosing the best number of hidden nodes per hidden layer is one of the most difficult problems when using NNs. There have been mauy empirical and thecretical studies attempting to answer this question. The answer depends on the structure of the NN, types of activation functions, training algorithm, and problem being solved. If too few hidden nodes are used, the target function may not be learned (underfitting). If too many nodes are used, overfitting may occur. Rules of thumb are often given that are based on the size of the training set. o rnterconnections: In the simplest case, each node is connected to all nodes in the next level. o weights: The weight assigned to an arc indicates the relative weight between those two nodes. Initial weights are usually assumed to be srnall positive numbers and are assigned randomly' \-, n,'ouodifferent {%ti.X?ra';iPnu'|ffi change. This learning process contimres with all the training data or until the classificatiou accuracy is adequate. Solving a classification problern using NNs iDvolvesseveral steps: o Tlaining data: As with DTs, with too much training data the NN may suffer from overfitting, while too little and it may not be able to classify accurately enor.rgh. o Number of sinks: Although it is usually assumed that the number of output nodes is the same as the nurnber of classes,this is not always the ca,se.For example, with two classesthere could only be one output node, with the resulting value being the probability of being in the associated class. subtracting this value from one would give the probability of being in the second class. sJ LIBRARY o NNs are more robust than DTs in noisy environments' Conversell', NNs have many disadvantages: o NNs are difficult to understand. Nontechnical users rnay have difficulty unclerstanding how NNs work. While it is easy to explain decision trees, NNs are much urore difficult to understand' o Generating rules from NNs is not straightforward' r Input attribute values must be numeric' o Testing o Verification o As with DTs, overfitting may result. o The learning phase may fail to converge' o NNs may be quite expensive to use. 4.5.1 Propagation The normal approach u3ed for processing is called propagation. Given a one value is input at tuple of values input to the NN, X : \rt,.'.,rn), eerchnode in the input layer. Then the summation and activation functions are applied at each node, with an output value created for each output arc from that node. These values are in turn sent to the subsequent nodes' This processcontinuesuntil a tuple of output values,Y : (Vt,....g-), is produced from the nodes in the output layer. The process of propagation is shown in Algorithm 4.4 using a neural network with one hidden layer. Here a hyperbolic tangent activation function is used for the nodes in the hidden layer. while a sigmoid function is used for nodes in the output layer. .rl 2 Data Mining-lntroductory Chapter4 and Advanced Topics We assume that the constant c in the activation function has been provided. We also use A to be the nurnber of edges coming into a node. Ar,conrrnu 4.4 Input: il X : (rr, . . . , oh) ,//neural network / /Input tuple consisting only input attributes of values for Output: Y: (Ar,'..,!n) //TupLe consisting of outPut values fron Nll Propagat ion algor i thm : illustrates propagatj.on of a tuple //Algoritbn through a NN for each node ,i in the input layer do Output .Di on each output arc from d; for each hidden )-ayer do for each node i do / - t\ o .d-:/ r - k D \ 2 _j = t \ w j i a j i ) ) i for each output arc fron i do outPut for NN SuPervised Learning The NN starting state is modified based on feedback of its performance with the data in the training set. This type of learning is referred to as superu'isedbecauseit is known a priori what the desired output should be. (Insuperu'ised learning can also be performed if the output is not known. With unsupervised approaches,no external teacher set is used. A training set may be provided, but no labeling of the desired outcome is included. In this case, similarities and differences between different tuples in the training set are uncovered. In this chapter, we examine supervised learning. x 1 lender X< J4 x2 Height Medium J) ttl ' "-'"=t ' (l+;:-est each node i in the output '\ roi . -: / r - & j : l t/ n t i t ? , i t ) \L )i Output gi: 4.5.2 Classification.103 layer do 7---jcsT;, [1+e ) A simple application of propagation is shown in Example 4.10 for the height data. Here the classification performed is the same as that seenwith the decisiontree in Figure 4.10(d). EXAMPLE 4.10 Figure 4.13 shows a very simple NN used to classify university students as short, medium. or tall. There are two input nodes, one for the gender data and one for the height data. There are three output nodes, each associated with one class and using a simple threshold activation function. Activation function /3 is associated with the short class, /a is associated with the medium class, and /5 is associated with the tall class. In this case, the weights of each arc from the height node is 1. The weights on the gender arcs is 0. This implies that in this case the gender values are ignored. The plots for the graphs of the three activation functions are shown. FIGURE4.13: Example propagation for tall data. Supervised learning in an NN is the process of adjusting the arc weights based on its performance with a tuple from the training set. The behavior of the training data is known a priori and thus can be used to fine-tune the network for better behavior in future similar situations. Thu$, the training set can be used as a "teacher" during the training process. The output from the network is compared to this known desired behavior. Algorithm 4.5 outlines the steps required. One potential problem with supervised learning is that the error may not be continually reduced. It would, of course, be hoped that each iteration in the learning process reduces the error so that it is ultimately below an acceptable level. However, this is not always the case. This may be due to the error calculation technique or to the approach used for modifying the weights. This is actually the general problem of NNs. They do not guarantee convergence or optimality. Alcomrnw Input: IT x D 4.5 neural- network //Starting tuple froro training //Irpt* //Output tuple desired set ,+ ljara Mrntng-tntroductory anct Advancecl loprcs LnaPfer //Inproved neural network aLgorithn: algorith:n to illustrate //Simplisti.c approach to NN learning Propagate X througb iV producing output Y; CalcuLate error by comparing D to I; Update weights on arcs in JVto reduce error; Backpropagation is a learning technique that adusts weights in the i{N bv propagating weight changes backrvard from the sink to the source nodes. Backpropagation is the most well known form of learning because it is easy to understand and generally applicable. Backpropagation can be thought of as a generalized delta rule approach. Notice that this algorithm must be associatedwith a means to calculate tiie error as well as some technique to adjust the weights. trIany techniques ha'e been proposed to calculate the error. Assuming that the output from node i is yi but should be d1, the error produced from a node in any layer cau be found by (4.36) Tlte mean squared enor (fuISE) is found by @'' - do)2 2 (4.37) This NISE can then be used to find a total error over all nodes in the net'*'ork or over only the output nodes. In the following discussion.the assumption is urade that only the final output of the NN is knon'n for a tuple in the tra.ining data. Thus. the total MSE error over ail m output nodes in the NN is $ (uo a,)' ?m (4.38) This formula could be expanded over all tuples in the training set to seethe total elror over all of them. Thus. an error can be calculated for a specific test tuple or for the total set of all entries. The Hebb and delta rules are approachesto change the weight on an input arc to a node based on the knowledge that the output value from that node is incorrect. with both techniques, a learning rule is used to modify' the input weights. suppose for a given node, i. the input weights are representedas a tuple (ur1r.... , ?rkj). while the input and output values arc (er1i,. . . , x:*t) and yi, respectively. The objective of a learning technique is to change the rveights based on the output obtained for a specific input tuple. The change in weights using the Hebb rule is represented by the following ruie Lwi, : c:riili I l# Figure 4.14 shows the structure and use of one node, j, in a neural network graph. The basic node structure is shown in part (a). Here the representative input arc has a weight of w73, where ? is used to show that the input to node j is coming from another node shown here as ?. Of course.there probably are multiple input arcs to a node. The output weight is similarly labeled urir. During propagation, data values input at the input layer flow through the network, with final values coming out of the network at the output layer. The propagation technique is shown in part (b) Figure 4.14. Here the smaller dashed arrow underneath the regula"r graph arc shows the input value r7i flowing into node j. The activation function ,fi is applied to all the input values and weights, with output values resulting. There is an associated input function that is applied to the input values and weights before applving the activation function. This input function is typicall-v* a weighted sum of the input values. Here yiz sirorvs the output lalue flowing (propagating) to the next node from node j. Thus. propagation occurs by applving the activation function at each node, which then places the output value on the arc to be sent as input to the next nodes. In most cases,the activation function produces only one output value that is propagated to the set of connected nodes. The NN can be used for classification and/or learning. During the classification process, only propagation occurs. However, when learning is used after the output of the classification occurs, a comparison to the known classification is used to determine how to change the weights in the graph. In the simplest types of learning, learning progresses from the output layer backward to the input layer. trVeightsare changed based on the changes that were made in weights in subsequent arcs. This backward learning process is called backpropagation and is illustrated in Figure 14(c). Weight 'uir is modified to become utiT* Lui7. A learning rule is applied to this Luiz to determine the change at the next higher level Atu7j. a2j /\ uj'! r.tj u yjt uar ='n/rFTr (4.3e) Here c is a constant often called tlrc learninq rate. A rule of thumb is that " (a) Node j in NN (b) Propagationat Nodej (1.10) A u'i? 1-{r;4:? v Au,1 Aui." (c) Back-propagation at Node j entrit.s in training setl' A variation of this approach, called the delta rule. examines not onll' the output value y, but also the desired value di for output. In this case the change in weight is found by the rule Ltuii:crii(dr-gr) LtdssiltLdf.l(Jtl The nice feature of the delta rule is that it minimizes the error di - yi at each node. Output: ff suplearn lv,-dil ,+ FIGURE 4.14: Neural network usaqe. i Chapter 4 Data Mining-lntroductory and AdvancedTopics Ar,comrnrvr 4.6 InDut: It IT X: (or,...,oh) D: (dr,...,dn) neural network / /Starting //TnPttt tuple fron training //OutPut tuPle desired OutDutss //InProved neural lI algorJ.tbn : BackgroDagation //IUustrate Propagation(lI, X) ; set network backProPagat ion E : r l 2 D T : , ( d t- y + ) 2 ; Gradient algorithn: incremental gradient //Illustrates for each node i in outPut laYer do for each node j inPut to ri do L , u t i i : n @ t ,- y ) y j ( r - y ) y i ; w 1 ' t* L w i ' t ; wji: layer: previous laYer; for each node j in tbis laYer do for each node lc in_Put to j do r_(U;), Auht ,kj Gradient (Itl,E) ; A simple version of the backpropagation algorithm is -shown in in Atgorithm i.6. The MSE is used to calculate the error. Each tuple algorithm the training set is input to this algorithm. The last step of the graph. uses grad,iint d,escent as the technique to modify the weights in the minthat The Lasic idea of gradient descent is to find the set of weights function error imizes the MSE. ffi Si"ur the slope (or gradient) of the wish to find the weight where this slope is zero. i-hus We for one weight. 4.7 illustrate the concept. The stated algorithm Algorithm and Figure 4.15 More hidden layers would be handled in layer. hidden one only *r'o*", the same manner with the error propagated backward' E Classificatiqr 107 descent Ir(dr - y)winlm(l- gr); rtyk-+ Upr' * Arirp3'; This algorithm changes weights by working backward frcim the output layer to the input layer. There are two basic versions of this algorithm. With the batch or offii,ne approach, the weights are changed once after all tuples in the training set are applied and a total MSE is found. With the incremental or online approach, the weights are changed after each tuple in the training set is applied. The incremental technique is usually pre' ferred because it requires less space and may actually examine more potential solutions (weights), thus leading to a better solution. In this equation, 4 is referred to as the learn'ing parameter. It typically is found in the range (0, 1), although it may be larger. This value determines how fast the algorithm learns. Applying a learning rule back through multiple layers in the network may be difficult. Doing this for the hidden layers is not as easy as doing it with the output layer. Overall, however, we are trying to minimize the error at the output nodes, not at each node in the network. Thus, the approach that is used is to propagate the output errors backward through the network. Output ukj yi ------) Desired weight FIGURE4.15: Gradient descent' Ar,conrtHrrl 4.7 InDuts I\I E OrrtDut s il neural Detltork //Starting / /Ertor found fron back algoritbn //ImProved neural Detwork FIGURE4.16: Nodes for gradient descent. Figure 4.16 shows the structure we use to discuss the gradient descent algorithm. Here node ri is at the output layer and node j is at the hidden layer just before it; grl is the output of i and U7 is the output of j. The learning function in the gradient descent technique is based on using the following value for delta at the output layer: ^ t\uii: - r l AE A_j,: - n 0E }ui }Si 6 0 ,A & A W (4.4r) Chapter4 f.5.4 Perceptrons The simplest NN is called a perceptron. A perceptron is a single neuron with multiple inputs and one output. The original perceptron proposed the use of a step activation function, but it is more common to seeanother type of function such as a sigmoidal function. A simple perceptron can be used to classify into two classes.Using a unipolar activation function, an output of 1 would be used to classify into one class, while an output of 0 would be used to pass in the other class. Example 4.11 illustrates this. EXAMPLE Classification lll using an NN with only one hidden layer. In this case, the NN is to have one input node for each attribute input, and given n input attributes the hidden layer should have (2n * 1) nodes, each with input from each of the input nodes. The output layer has one rrode for each desired output value. 4.6 RUtE-BASED ALGORITHMS One straightforward way to perform classification is to generate if then rules that cover all cases. For example, we could have the foilowin! rules to determine classification of grades: 4.11 Figure 4.18(a) shows a perceptron with two inputs and a bias input. The three weights are 3,2, and - 6, respectively. The activation function /a is thus applied to the value S : 3rr f 2r2 - 6. Using a simple unipolar step activation function, we get /.:il :lf"iJ." ) If 90 < grade, then class : If 80 < grade and grade ( g0, then class : B If 70 < grade and grade ( 80,then class : If 60 < grade and grade ( 70, then class : D If grade < 60, then class : F (4.55) ,4 C A classi,fication rule, , : (o,c), consists of the if or anteced,ent, part a, and the then or consequent portion, c. The antecedentcontains a predicate that can be evaluated as true or false against each tuple in the database (and obviously in the training data). These rules relate directly to the corresponding DT that could be created. A DT can always be used to generate rules, but they are not equivalent. There are differentes between rules and trees: x2 r The tree has an implied order in which the splitting is performed. Rules have no order. (a) Classificationperceptron (b) Classification problem o A tree is created based on Iooking at all classes. when generating rules, orrly one class must be examirred at a time. FIGURE4.18: Perceptron classification example. An alternative way to view this classification problern is shown in Figure 4.18(b). Here 11 is shown on the horizontal axis and 12 is shown on the vertical axis. The area of the plane to the iight of the line rz : 3-3 /2rr representsone class and the rest of the plane representsthe other class. The simple feedforward NN that was introduced in Chapter 3 is actually called a mult'ilayer perceptron (MLP). An Iv,ILP is a network of perceptrons. Figure 3.6 showed an MLP used for classifying the height example given in Table 4.1. The neurons are placed in layers with outputs always flowing toward the output layer. If only one layer exists, it is called a perceptron. If multiple layers exist, it is an MLP. In the 1950s a Russian rnathematician, Andrey Kolmogorov, proved that an MLP needs no more than two hidden layers. Kolmogorou's theorem states that a mapping between two sets of nunrbers can be perfornred There are algorithms that generate rules from trees as well as algorithms that generate rules without first creating DTs. 4.6.1 Generating Rules from a DT The orocess to generate a rule from a DT is straightforward and is outlined in Algorithm 4.8. This algorithm will generate a rule for each leaf node in the decision tree. All rgles with the same consequent courd be combined together by ORing the antecedentsof the sirnpler rules. Ar,coRrrnv 4.8 flrDut ! T OutDuts R / /Decision tree //RuIes volo rvrilrIrEr-rllLrequcfory an(t AOVanced Chapter4 I oplcs and dividing cotrtinuous values into disjoint rarlges. The rule extraction algorithm. RX. shos'n in Algorithm 4.9 is derived from [LSL95]. Gen algorithn: sinple approach to g€nerating //ILlustrate rules fron a DT classification Arconrruv R:0 for each path fron root to a leaf in T do o = True f,or each non-leaf node do a: oA (labe} of node corobined uitb labe1 of lncident outgoj.ng arc) c: Label of leaf node R : R U 1 " : ( o .c ) Input: D l/ ( 1.6 m), Short) {((Height (((Height > 1.6 m) A (Height S 1.7 rn)), Short) (((Height > 1.7 m) n (Height < 1.8 m)), Nledium) (((Height > 1.8 m) A (Height < 1.9 m)),I\Iedium) (((Height > 1.9 m) A (Height < 2 tt) A (Height > 1.95 m)). Tall) ((Height > 2 m),Tall)) An optimized version of these rules is then: {((Height < 1.7 m),Short) (((Height > 1.7 m) A (Height < 1.95 m)), Medium) ((Height > 1.95 m), Tall)) 4.6.2 Generating Rules from a Neural Net To increase the understanding of an NN, classification rules may be derived from it. While the source NN may still be used for classification, the derived rules can be used to verify or interpret the network. The problem is that the rules do not explicitly exist. They are buried in the stmcture of the graph itself. In addition, if learning is still occurring, the rules themselves are dynamic. The rules generated tend both to be more concise and to have a lower error rate than rules used with DTs. The basic idea of the RX algorithm is to cluster output values with the associated hidden nodes and input. A major problem with rule extraction is the potential size that these rules should be. For example. if you have a node with n inputs each having 5 values, there are 5' different input combinations to this one node alone. These patterns would all have to be accounted for when cotrstructing rules. To overcome this problem and that of having continuous ranges of output values from nodes, the output values for both the hidden and output layers are first discretized. This is accomplished by clustering the valrres 4.9 / /Trainj.ng data neural //Initial- network 0utput: R / /Derived rul-es RX algorithm: //HuIe extraction algorith.n to extract rules from NN cluster output node activation values; cluster hidden node activation values; generate rules that descri.be the output varues in terns of the hidden activation values; generate rules that describe hidden output values in terms of inputs; combine the two sets of ru1es. Using this algorithm, the following rules are generated for the DT in Figure 4.12(a)t (((Height > 1.9 m) A (Height S 2 m) A (Height < 1'95 m)), Nledium) C l a s s i f i c a t i o nf 1 3 4.6.3 Generating Rules Without a DT or NN These techniques are sornetimes called coueri,ng algorithms because they attempt to generate rules exactly cover a specific class [wF00]. Tl.ee aigorithrns work in a top-down divide and conquer approach, br.rt tiris peed uot be the case for co"'ering algorithms. They generate the best prle possible by' optirnizing the desired classification probabiiity. Usuaily tlie ,,best" attributrvalue pair is chosen, as opposed to the best attriblte with the tree-based algorithms. suppose that we wished to generate a rule to classifv persons as tall. The basic format for the rule is then If? then class: tall The objective for the covering algorithrns is to replace the ''?" in this statement with predicates that can be used to obtain the '.best'' probability of being tall. one simple approach is called 1R because it generates zrsinrple set of rules that are equivaient to a DT with onlv one level. The basic irlea is to choosethe best attribute to perfonn the classification baseclon thertrainilg data. "Best" is defined here by counting the nurnber of errors. Irr Table 4.,4 tliis approacir is illustrated usirig the height exanrple, outputl. If we o'iv ttse tlre gender attribute, there are a total of 6/rb errors. whereas if we rrse tlre ireight attribute, there are oilv rf lb. Thus. the heiglrt rvoulclhe chosen and the six rules stated in the table would be use<I.As rvith ID3. 1R terrtls to choose attributes with a large munber of valucs leadirrg to overfittilg. 1R can handle missing data by adding an additiontrl attlibutt: r,4[rerfor.the valrre of rnissing. Aigorithui 4.10. which is adapted fr<.rru[\\'-F(x)]. slxrws the outline for this algorithm. ChaPter4 14 Classification 115 Data Mining-lntroductory and AdvancedTopics 1.7 < Height (: 1.8 < Height (: TABLE 4.4: 1R Classification Option Attribute Rules Errors Gender F - Medium l1'l+ Tall (0, 1.6]- Short - Short (1.6.1.71 (1.7,1.8]---+ Medium * Medium (1.8,1.9] * Medium (1.9,2.0] (2.0,m) Tali 3le 316 012 012 013 014 Height n c r12 012 212 If 2.0 < height, then class : tall Since all tuples that satisfy this predicate are tall, we do not add any additional predicates to this rule. We now need to generate additional rules for the tall class. We thus look at the remaining 13 tuples in the training set and recalculate the accuracy of the corresponding predicates: Gender: .lvf Height (: 1.6 rules //Rules 1R algorithm: //tR al'gotithm generates rules based on one attribu ll2 Based on this analysis, we would generate the rule tl]5 Gender: F //Training data to consider for //Attrlbttes //Classes each 4€R 0/4 2.0 < Height 6lr5 Output: for 0/3 1.9 1 . 9< H e i g h t 1 : 2 . 0 Total Errors Ar,conrtnv 4 . L O Input: D 1.8 1.6< Height 1: I.7 1.7< Height (: 1.8< Height (: 1.8 1.9 1 . 9 < H e i g h t< : 2 . 0 0ls rl4 012 012 ol3 014 r12 Based on the analysis, we seethat the last height range is the most accurate and thus generate the rule: do RA: 0; for each possible value, u, of 4 do value //u naY be a range rather than a s p e c i f i c for each Ci e C find count(Ci); / / Here count is tbe number of occurrences class for this attribute let Cp be the class with the largest count; RA: Pl U ((A: v) * (class : Cn)); by : nunber of tuples lncorrectly classified ERRr4 is nininun; R: Ra where ERRr4 If 2.0 < height, then class: tall However, only one of the tuples that satisfiesthis is actually tall, so we need to add another predicate to it. We then look only at the other predicates affecting these two tuples. We now see a problem in that both of these are males. The problem is actually caused by our "arbitrary" range divisions. We now divide the range into two subranges: 1.9< Height (: 1.95 1 . 9 5< H e i g h t < : 2 . 0 0/L lll We thus add this second.predicateto the rule to obtain EXAMPLE 4.I2 If 2.0 < height and 1.95 < height 1:2.O,then Using the data in Table 4.1 and the Outputl classification, the fol shows the basic probability of putting a tuple in the tall class based on given attribute-value Pair: Gender:F Gender: M Height(: 1.6 1.6< Height1: L.7 019 316 ol2 Ol2 or class: tall If 1.95 < height, then class: tall This problem does not exist if we look at tuples individually using the attribute-value pairs. However, in that case we would not generate the needed ranges for classifying the actual data. At this point, we have classified all tall tuples. The algorithm would then proceed by classifying the short and medium classes.This is left as an exercise. tt, rrdLdrvilntnt-rnrrooucloryano AovanceclI oprcs Chapter4 Classification Ll7 Alconnxnr 4.11 Pt(Ci | 11) for each class. The values are cornbinc'dwith a u'eighted linear conrbination Input: D C /,/Training //Classes data \w*P*(C, 0utput: ll / / Kules PRISM algorithrn: //PRISM algorith.m generates rules based on best attributevalue pairs R: A; for . each Ci € C do repeat T: D; //A11 instances of class Ci wi.ll be systenatical.I renoved from T p : true:' / /Create new rule with enpty left-hand side r: (if p then C4); Here the weights. u'a. can be a^ssigned b1' a user or learned based on the past accuracy ofeach classifier. Another technique is to choosethe ciassifierthat lras tlre best accuracy in a database sample. T'his is referred to as a dyrt.am,i,r: classi,fierselection (DCS). Example 4.13. whicli is modified from lLJ98], illustrates the use of DCS. Another valiatiou is sirnple voting: assign the tuple to the class to which a majoritv of the classifiers have assigned it. This may have to be rnodified slightly in case there are manv classesan<l rio rnajority is found. o a Tuplein Class1 and correctlyclassified repeat for each attribute A value v pair found in T do nn r Tuplein Class1 and incorrectlyclassified ox calculate(ffi), I f ind {: o that p:pA(A:tt)i 7: {tuples in unt i I all tuples D:D-T; R:RUr; until there are no (a.56) lt1) A:1 naxi.ni,zes this value; I that satisfy A:u}; iu I belong to C1; AO (a) ClassifierI (b) Classifier2 o Tuplein Class2 and correctlyclassified r Tuple in Class2 and incorrectlyclassified FIGURE4.19: Combination of multiple classifiers. tuples in D that belong to Ci; .7 COMBINING TECHNIQUES Given a classification problem, no one classification technique always yields tire best results. Therefore, there have been some proposals that look at conrbining techniques. o A synthesis of approachestakes rnultiple techrriquesand blends thern into a new approach. An example of this would be using a prediction technique, such as linear regression, to predict a future value for an attribute that is then used as input to a classification NN. In this way the NN is used to predict a future classification value. o \tlrrltiple independenl approaches can be applied to a classification problem. each yielding its own class prediction. The results of these individual techniques can then be combined in some rnanner. This approachhas been referredto as combination of rnulti,ple classifiers (CMC). One approach to conrbine independent classifiersassumesthat thele ;11's rz indeperrdent classifiersand that each generates the posterior probability EXAMPLE 4.13 Two classifiersexist to classify tuples into two classes. A target tuple. X, needs to be classified. Using a nearest neighbor approach, the 10 tuples closest to X are identified. Figure 4.19 shows the 10 tuples closest to X, In Figure 4.19(a) the results for the first classifier are shown, while in Figure 4.19(b) those for the second classifier are shown. The tuples designated with triangles should be in class 1, while those shown as squares should be in class 2. Any shapes that are darkened indicate an incorrect classifi.cationby that classifier. To combine the classifiersusiug DCS, look at the general accuracy of each classifier. with classifier r,T tuples in the neighborhood of X are correctly classified,while with the second classifier, only 6 are correctly classified. Thus, X will be classified according to how it is classifiedwith the first classifier. 118 Data Mining-lntroductory and AdvancedTopics 4.8 REVIEW QUESTIONS 1. What are the issuesin classification? Explain with an example. 2. Give an example for classification using division. 3. Give an example for classification using prediction. 4. Explain with an example Bayesian classification. 5. What do you mean by distance-based algorithms? 6. Explain decision tree-based algorithms. 7. What do you mean by entropy? 8. What do you mean by CARIT? 9. Explain scalable DT techniques. CHAPTER 5 Clustering 5.1 INTRODUCTION 5.2 SIMILARITY AND DISTANCE MEASURES 5.3 OUTLIERS 5.4 HIERARCHICALALGORITHMS 5.5 PARTITIONAL ALGORITHMS 5.6 CLUSTERINGLARGE DATABASES 10. Explain NN-based (neural network) algorithms. 5.7 CLUSTERINGWITH CATEGORICATATTRIBUTES 11. What do you mean by rule-based algorithms? 5.8 COMPARISON 12. Explain combining techniques. 5.9 REVIEW QUESTIONS 13. How do you generate rules from a neural net? 5.1 INTRODUCTION Clustering is similar to classification in that data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have been proposed: o Set of like elements. Elements from different clusters are not alike. o The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it. A term similar to clustering is database segmentat'ion, where like tuples (records) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data. In this text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This example illustrates the fact that determining how to do the clustering is not straightforward.