Download Classification - E

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CHAPTER
Classification
4 . 1 INTRODUCTION
4 . 2 STATISTICAL-BASEDALGORITHMS
4 . 3 DISTANCE-BASEDATGORITHMS
4 . 4 DECISIONTRE€-BASED ALGORITHMS
4 . 5 NEURAL NETWORK-BASEDALGORlTHMS
4 . 6 RULE-BASEDATGORITHMS
4 . 7 COMBINING TECHNIQUES
4.8 REVIEW QUESTIONS
4 . 1 INTRODUCTION
Classification is perhaps the most familiar and most popular data mining
technique. Examples of classification applications include image and pattern recognition, medical diagnosis, loan approval, detecting faults in industry applications, and classifying financial market trends. Estimation and
prediction may be viewed as types of classification. When someone estimates your age or guessesthe number of marbles in a jar, these are actually
classification problems. Prediction can be thought of as classifying an attribute value into one of a set of possible classes. It is often viewed as forecasting a continuous value, while classification forecasts a discrete value.
Example 1.1 in Chapter 1 illustrates the use of classification for credit card
purchases. The use of decisibn trees and neural networks (NNs) to classify
people according to their height was illustrated in chapter 3. Before the
use of current data mining techniques, classification was frequently performed by simply applying knowledge of the data. This is illustrated in
Exarnple 4.1.
2
Classification 73
Chapter4
Data Mining-lntroductory and AdvancedTopics
As discussedin [KLR+98]. there are three basic methods used to solve
the classification Problern:
EXAMPLE 4.I
o Specifying boundaries. Here classification is performed by dividing the input space of potential database tuples into regions where
each region is associatedwith one class.
Tezrcirersclassify students as A. B. C. D. or F based on their marks. By
usirrg sinrple bounclaries(60, 70, 80. 90). the folloq,'ingclassificationis
possible:
90 ( mark
80 < rnark < 90
70<mark<80
60 ( rnark < 70
mark < 60
For any given class' Ci, P(f1 |
o lJsing probability distributions.
C,) is the PDF for the class evaluated at one point, f6.t If a probability of occurrence for each class, P(Ct) is known (perhaps determined
by a domain expert), then P(Ci)P(t, I Ci) is used to estimate the
probability that t; is in class Ci.
A
B
C
D
F
o Using posterior probabitities. Given a data value t1, we would like
to determine the probability that t1 is in a class C1. This is denoted by
P(Ci I t1) and is called the posterior probability. One classification
approach would be to determine the posterior probability for each
class and then assign ti to the class with the highest probability.
All approachesto performing classification assurnesome knowledge of
thc data. Oftcn a training set is used to develop the specific parameters
reqrrircrl bv the technique. Train.i.ng dala consist of sample input data as
well as the classification assignment for the data. Domain experts may also
be used to assistin the process.
The naive divisions used in Example 4.1 as well as decision tree techniques
are examples of the first modeling approach. Neural networks fall into the
third category.
The classification oroblem is stated as shown in Definition 4.1:
D n p r l t r r r o N 4 . 1 . G i v e n a d a t a b a s eD : { h , t 2 . . . . . t , . } o f t u p l e s
( i t e r n s .r e c o r d s )a n d a s e t o f c l a s s e sC - - { C r , . . . . C - } . t h e
classification problem is to define a mapping f : D - C where each li
is assigned to one class. A class, C3, contains precisely those tuples
r r r a p p e dt o i t ; t h a t i s . C i : { t i | / ( t i ) : C j . I 1 i l n . a n d l i e D } .
Oru definition r.iews classification as a mapping from the database to the
set of classes. Note that the classesare predefined, are nonoverlapping,
and partition the entire database. Each tuple in the database is assigned
to cxactly one class. Tire classesthat exist for a classification problem
are indeerd e.qtLi,ualenceclasses. In actuality. the problem usually is
implenrented in tu,.ophases:
1. Create a specific model by evaluating the training data. This step
has as input the training data (including defined classification for
each tuple) and as output a definition of the model developed. The
rnoclelcreated classifiesthe training data as accurately as possible.
2. Applv the model developed in step 1 by classifying tuples from the
target database.
Although the second step actually cloesthe cla.ssification(according to the
ciefirritionin Defiriition 4.1), rnost researchhas been applied to step 1. Step 2
is oft err st r-aielrtfor\&'ar-d.
l0
9
8
'l
ClassA
6
10
9
8
7
6
5
A
A
3
2
I
0
J
0r23456'78
ClassB
ClassC
(a) Definition of classes
z
I
0
(b) Sample database to classify
'il
x
x
x
x
0 I
(c) Database classified
FIGURE4.1: Classificationproblem'
Suppose we are given that a database consists of tuples of the form
t : (r,g) where 0 < r 3 8 and 0 < y < I0. Figure 4.1 illustrates the classification problem. Figure 4.1(a) shows the predefined classesby dividing
the reference space, Figure 4.1(b) provides sample input data, and Figure
4.1(c) sliows the classification ofthe data based on the defined classes.
A major issueassociatedwith classification is that of overfitting. If the
classification strategy fits the training data exactly it may not be applicable
to a broader population of data. For example, suppose that the training
1In this discussion each tuple in the database is assumed to consist ofa single value rather than
a set of vralues.
Chapter 4
Topics
Data Mining-lntroductory and Advanced
in this case, fitting the data
data has erroneous or noisy data. certainly
exactlY is not desired'
to performing classifiIn the following sections, various approaches
data to be used throughout this
cation are examinei t't'tu 4'1 contains
chaptertoillustratethevarioustechniques.Thisexampleassumesthatthe
problemistoclassifyadultsasshort'medium'ortall'Table4'llistsheight
table show two classifications that
in meters. The tastiwo columns of this
Output2' respectively' The Outputl
could be *ua"' tu'U"i"d Outputl and
shown below:
classification uses the simple divisions
2m<Height
--2m
l.TmcHeight
(
1.7m
Height
Tall
Medium
Short
TheoutputZresultsrequireamuchmorecomplicatedsetofdivisionsusing
both height and gender attributes'
based on the cate'
In this chapter we examine classification algorithms
gorization*,u",'_i.'Figure4.2.Statisticalalgorithmsarebaseddirectlyon
Distance.based algorithms use similarity
the use of statistical intrmation.
ordistancemeasurestoperformtheclassification,DecisiontreeandNN
approachesusethesestructurestoperformtheclassification.Rule.based
rules to perform the classification'
classification utgoriiir*, generate iflth"n
TABLE4.1: Data for Height Classification
Name
Kristina
Jim
Maggie
Martha
Stephanie
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
Outputl
Output2
1 . 6m
2m
1 . 9m
1.88m
1 . 7m
1.85m
1 . 6m
1 . 7m .
2.2m
2 . 1m
1 . 8m
1 . 9 5m
1 . 9m
1 . 8m
1.75m
Short
Tall
Medium
Medium
Short
Medium
Short
Short
TaIl
TaIl
Medium
Medium
Medium
Medium
Medium
Medium
Medium
Tall
TalI
Medium
Medium
Medium
Medium
TaIl
Tall
Medium
Medium
Tall
Medium
Medium
Statistical Distance DT
NN
Classification 75
Rules
FIGURE4.2: Classification algorithm categorization.
4.1.1
lssues in Classification
Missing data values cause problems during both the
Missing Data.
training phase and to the classification process itself. Missing values in
the training data must be handled and may produce an inaccurate result.
Missing data in a tuple to be classified must be able to be handled by
the resulting classification scheme. There are many approaches to handling
missing data:
r Ignore the missing data.
r Assume a ralue for the missing data. This may be determined by
using some method to predict what the value could be.
o Assume a special value for the missing data. This means that the
value of missing data is taken to be a specific value all of its own.
Notice the similarity between missing data in the classification problem and
that of nulls in traditional databases.
Measuring Performance.
Table 4.1 shows two different classification results using two different classification tools. Determining which is
best depends on the interpretation of the problem by users. The performance of classification algorithms is usually examined by evaluating the
accuracy of the classification. However, since classification is often afuzzy
problem, the correct answer may depend on the user. Tladitional algorithm
evaluation approaches such as determining the space and time overhead can
be used, but these approaches are usually secondary.
Classification accuracy is usually calculated by determining the
percentage of tuples placed in the correct class. This ignores the fact that
there also may be a cost associated with an incorrect assignment to the
wrong class. This perhaps should also be determined.
An OC (operating characterist'i.c) curae or ROC (recei'uer operating
characteristi,c) curue or ROC (relatiue operat'ing characterist'ic) curue
shows the relationship between false positives and true positives. An OC
curve was originally used in the communications area to examine false alarm
rates. It has also been used in information retrieval to examine fallout
(percentage of retrieved that are not relevant) versus recall (percentage of
retrieved that are relevant). In the o(J curve the horizontal axis has the
percentage of false positives and the ver'tical axis has the percentage of true
positives for a database sample. At the beginning of evaluating a sample,
there are none of either category, while at the end there are 100 percent
76 Data Mining-lntroductory and AdvancedTopics
Chapter4
of each. when evaluating the results for a specific sample, the curve looks
like a jagged stair-step. as seen in Figure 4.3. as each new tuple is either a
false positive or a true positive. A rnore smoothed version of the OC curve
can also be obtained.
1000/0
the database D and the output'"alues represent the classes.Regressioncan
be used to solve classification problems, but it can also be used for other
applications such as forecasting. Although not explicitly described in this
text. regressioncan be performed using manv different types of techniques,
including NNs. In actuaiity, regressiontakes a set of data and fits the data
to a formula.
Looking at Figure 3.3 in Chapter 3. we seethat a simple l,,inear.regres_
,sion problem can be thought of as estimating the formula for a straight line
(in a two-dimensional space). This can be equated to partitioning the data
into two classes.With the banking exarnple, these would be to approve or
reject a loarr application. The straiglrt line is the break-even point or the
division between the two classes.
a<o/
9
)U"/o
-
In chapter 2. rve briefly introduced ii^ear regressionusing the formula
25%
a:colc1:t11..'*cnrn
"
25o/"
50o/" 75o/o
False positives
10O%
FIGURE4.3: Operating characteristiccurve
IABLE 4.2: Confusion lVlatrix
Actual
Nlembership
Short
Medium
Tall
Assignment
Short
Nledium
Tall
0
0
0
4
0
3
2
tr
1
A confusion matrix illustrates the accuracy of the solution to a classification problem. Given m classes,a confusion matrir is an m x rn matrix
where entry ci,i indicates the number of tuples fronr D that were assigned
to class Ci but where the correct class is C1. Obviously, the best solutions will have only zero values outside the diagonal. Table 4.2 shows a
confusion matrix for the heiglrt example in Table 4.1 where the outputl
assignment is assumed to be correct and the output2 assignment is what
is actually made.
1.2
1.2.1
Classification 77
STATISTICAL-BASED ALGORITHMS
Regression
Regression problerns deal with estimation of an output value based on
input values. When used for classification, the input values are values frosl
(4.1)
By determining the regression coefficients cotc!,.. . . c, the relationship between the output parameter. g, and the input parameters, 11, . . . , lLn
can be estimated. AII high school algebra students are familiar with deterrnining the formula for a straight line, g : mn * b, given two points in the
rg plane. They are determining the regressioncoefficients m ancl b. Here
the two points represent the training data.
Adrnittedly. Exarnple 3.5 is an extremely simple proble*r. However, it
illustrates how we all use the basic classification or prediction techniques
frequently. Figure 4.4 illustrates the more general use of linear regression
with one input value. Here there is a sample of data that we wish to model
(shown by the scatter dots) using a linear model. The line generated by
the linear regressiontechnique is shown in the figure. Notice, however, that
the actual data points do not fit the linear rlodel exactly, Thus, this model
is an estimate of what the actual input-output relationship is. we can use
the generated linear rnodel to predict an output value given an input value,
but unlike that for Example 3.5, the prediction is an estimate rather than
the actual output value. If we attempt to fit data that are not linear to a
linear model, the results will be a poor model of the data, as illustrated by
Figure 4.4.
There are many reasons wh;r the linear regression model may not be
used to estimate output data. one is that the data do not fit a linear
model. It is possible, however. that the data generally do actually represent
a linear model, but the linear model generated is poor because noise or
outliers exist in the data.. Noise is erroneous data. Outliers are data values
that are exceptions to the usual and expected data. Example 4.2 illustrates
outliers. In these casesthe observable data may actually be described by
the following:
a : c o t c 1 ; r 1 + . . . + c n . l : n+ e
(4.2)
Chapter 4
78 Data Mining-lntroductory and AdvancedTopics
Here e is a random error with a mean of 0. As with point estimation, we can
estimate the accuracy of the fit of a linear regression model to the actual
data using a mean squared error function.
Classification Tg
finds coefficientsc6,cr so that the squared error is minimized for the set of
observable values. The sum of the squares of the errors is
&fr
L:Dr|
: L @ n - c o- c t r r i ) 2
,-l
(4.4)
;-1
Taking the partial derivatives (with respect to the coefficients)and setting
equal to zero, we can obtain the least squares est'imates for the coefficients,
co and ci.
Regression can be used to perform classification using two different
approaches:
1. Division:
2. Prediction:
value.
4,2
suppose that a graduate level abstract algebra class has 100 students.
Sahana consistently outperforms the other students on exams. On the
final exam, Sahana gets a grade of 99. The next highest grade is 75, with
the range of grades being between 5 and 99. Sahana clearly is resented by
the other students in the class because she does not perform at the sarne
level they do. She "ruins the curve." If we were to try to fit a model
to the grades, this one outlier grade would cause problems because any
model that attempted to include it would then not accurately model the
remaining data.
We illustrate the process using a simple linear regtession formula and assuming ,k points in our training sample. We thus have the following ,t
formulas:
y i : c o t c l r v I e i , ' i: 1 , . . . , k
Formulas are generated to predict the output class
Tire first case views the data as plotted in an n-dimensional space without
any explicit class values shown. Through regression, the space is divided
into regions-one per class. With the second approach, a value for each
class is included in the graph. Using regression, the formula for a line to
predict class values is generated.
FIGURE4.4: Example of poor fit for linear regression.
EXAMPLE
The data are divided into regions based on class.
(4.3)
With a simple linear regression, given an observable value (rvi,y1)' ea is
the error, and thus the squared error technique introduced in Chapter 3
can be used to indicate the error. To minimize the error, a method of
least squares is used to rninimize the least squared error. This approach
Exarnple 4.3 illustrates the division process, while Exarnple 4.4 illustrates the prediction process using the data from Table 4.1. For simplicity,
we a^ssumethe training data include only data for short and medium people
and that the classification is performed using the Outputl column values.
If you extend this example to all three classes,you will see that it is a
nontrivial task to use linear regressionfor classification. It also will become
obvious that the result may be quite poor.
EXAMPLE
4.3
By looking at the data in the Outputl columrr from Table 4.1 and the
basic understanding that the class to which a person is assigned is based
only on the numeric value of his or her height, in this example we apply
the liriear regressionconcept to determine how to distinguish between the
short and mediuur classes. Figure 4.5(a) shows the points under consideration. We thus have the linear regression formula of A - co * e. This
implies that we are actually going to be finding the value for cs that best
partitions the height numeric values into those that are short and those
that are medium. Looking at the data in Table 4.1. we see that only 12
of the 15 entries can be used to differentiate between short and medium
persons. We thus obtain the following values for y; in our training data:
{ 1 . 6 , 1 . 9 ,1 . 8 8 ,1 . 7 ,1 . 8 5 ,1 . 6 ,1 . 7 ,1 . g ,1 . 9 5 ,1 . 9 ,1 . 9 ,1 . ? 5 } .W e w i s h t o D r i u i m i z e
t2
r:L{
i=t
t2
:D@, i:l
"d2
)ataMining-lntroductory and AdvancedTopics
Chapter 4
get
Taking the derivative with respect to c6 and setting equal to zero u€
o c o€
1
o
1
Classification 81
ooqDoo
12
L2
y53.63325.816
Medium
-2Dur +l2cs: s
i:l
i= 1
6
a
o
Eo.s
I o,s
"il
"il
F',f: I F.l,:-|
Medium
'n[3]
o
-'n[- 3 I
v = L'186
llu :lU
0
0
1.6
1.8
2
2.2
2.4
Height
(a)Shortandmediumheightswithclasses
(b) Division
FIGURE4.5: Classification using division for Example 4'3'
Solving for cq we find that
t2
,:
1.786
we thus have the division between short and medium persons as being
determined by g: 1.786.as seenin Figure 4'5(b)'
2
2.2
Height
(b) Prediction
2.4
L2
_ co_ cfiu)2
,2,=\fu;
I
i: I
i:t
Taking the partial derivative with respect to cs and setting equal to zero
we get
Ar
# :"
t2
L2
-2Lat +lz"o +
I
i=l
i:l
t2
2 c 1 r 1 iI:
i=l
To simplify the notation, in the rest of the example we drop the rauge values
for the summation becauseall are the same. solving for cs, we find that:
cb:
Lat
^: =?:
1.8
We thus wish to minimize
l2
\.a
1.6
FIGURE4.6; Classification using prediction for Example 4.4.
Short
(a) Short and mediumheights
Short
Lvo
-D"r*tn
T2
Now taking the partial of tr with respect to c1, substituting the value for
cs, and setting equal to zero we obtain
aL
^ \-r
d"r:2 L\ai
- co- clxY)(-rY) : Q
Solving for c1, we finally have
EXAMPLE
D''o Iro
4.4
we now look at predicting the class using the short and medium data as
input and looking at the outputl classification. The data are the same as
thtse in Example 4.3 except that we now look at the classesas indicated in
the training data. Since regression assumesnumeric data. we asslme that
the value for the short class is 0 and the value for the medium cla"ssis 1'
F i g u r e 4 . 6 ( a ) s h o w st h e d a t a f o r t h i s e x a m p l e : { ( 1 . 6 , 0 ) , ( 1 ' 9 , 1 ) , ( 1 ' 8 8 , 1 ) ,
( 1 : 70, ) , 1 i . e sr,; , ( 1 . 6 , 0 ) ,( 1 . 7 , 0 ) ,( 1 . 81, ) , ( 1 . 9 51.) , ( 1 . 91, ) , ( 1 . 8i,) '
(1.75, 1)). In this casewe are using the regressionformula with one variable:
A:Co*c1r1*e
!i'?n) \4'e can now solve for cs and c' . Using the data from the 12 points in the
training data, we have
! ru = 2I.48, Dyo : 8, D(rr&i) : 14.g3, and
:
38.42.
Thus,
we
get c1 : 3.63 and cs : ::5.916. The prediction
D(t?,)
for the class value is thus
9:-5.816*3.6321
This line is plotted in Figure 4.6(b).
Data Mining-lntroductory and
AdvancedTopics
Chapter4
predicts the class value is generated' This
In Example 4.4 a line that
three
but il also could have been done for all
wa"sdone for two t1*'"''
obvious
is
membership
where class
classes. Unlike tft" Jil'lrlo" approach
occurs' with prediction the class
point
o
t"ttittt
*itttit'
based on the region
In
obvious' Here we predict a class value'
to which a point belongs is less
0.9
0.8
.g 0.-s
J
landlessthan0.Tlrus,thevcertainlycanrtotbetrsedastheproba
bilityof
comrnonly Usedregressiontechnique
occurrenceofthe target class. Another
line'
Instead of fitting the data to a straight
is called log'isttc ,"g'-""io''
4.7.
Figure
in
such as is illustrated
logistic regression,r".-^ iogi*ic curve
ls
The formula for a univariate logistic curve
(4.6)
l+€('r'+'1rr)
0 and 1 so it can be interpreted
The togistic curve gives a value between
with linear tT::t::1":
As
as the probability of class mernbership'
:il
is desired' To perform th
classes
be used when classification into two
be applied to obtain the logisti
regression, the logarithmic function can
functiort
'*" (fi)
:711r t'1x'1
&
Q . n" ' 4
0.3
0.2
\
\
\
\
\
\
\
\
\
0.I
(4.5)
w h e r e f i i s t h e f r r n c t i o r r b e i n g u s e d t o t r a n s f o r m t l r e p r e d i c t o r ' I rtechniques'
rtlriscase
Linear regressiou
in" ,"g.L*i"n is called nonlinear regress'ion'
to most cornplex data nrining
while easy to understani' are not aiplicable
with nonnumeric data' They also
applications. They <1onot rvork well
nraketheasstrmptionthattherelationshiplbetweent}reinputvalueandthe
rnay not be the case'
output value is lirrear, which of course
becausethe data may not fit
Linear regressionis not always appropriate
line values can be greater than
tr straight line, but olro tr".rr,r" itu,truig1rt
t(co*crrl)
---"--
\\
\
i u.o
Ifthepre<lictorsintlrelirrearregressionfurrctionaremodifiedbysome
the model looks like
function (square. square root, etc')' tlien
tn:
- r ) l l + e \ p ( l+ r l ) \lrptt
!'xp(l - ()/(l + trp(l - r))- -
0.7
F i g u r e 4 . 6 ( b ) t h e c l a s s " v a l u e i s p r e d i c t e d b a s e c i o n t h e h e i gmembership
l r t v a l u e a l o nise .
ltowever' the class
Since the prediction liue is continuous'
the prediction for--ayalu3 is 0'4' what
not always on.'io"u' For example' if
woul<litsclassbe?Wecandeternrirrethecla..sllysplittirrgtlreline.Soa
heightisintheslrortclassifitspredictionvalueislessthan0.5anditisin
t h e r r r e d i u n r c l a s s i f i t s v a l u e i s g r e a t e r t l r a r r 0 . 5 . I r t E x a m p lbetween
e 4 . 4 t h ethe
value
is 1'74' Thus' this is really the division
of 11 where y:0S
short class and the ttrediutrt ciass'
y : c o + , f r ( r r+) " ' + f " ( r " )
Classification 83
(4.7)
Herepistlreprobabilityofbeirrgirrtlrec}assarrdl-llistlreprobabilitythat
ftrr c9 ancl r'l1that rnaxirnize
it is not. However, tt".'p-""rr"arooses,"alues
valtles'
given
the probabilitv of otrserving the
FIGURE4.7: Logistic curve.
4.2.2
Bayesian Classification
Assurning that the contributiou b1' all attribrrtes arc irrdependent and that
each contributes equally to the classific'atiouproblenr. rr,sinrplc classification
sclrenrecalled naiue Buges classification has been proposed that is based
on Brryes rule of conclitional protru,bility as statecl irr Definitiou 3.1. This
approach was briefly outliued in Cirapter 3. 81' analy'zing the contribution
of each "indepenclent" attribute. ti conditional probabilitt' is cletermined. A
classification is nrade by corubirring the irnpact that the clifferent attributes
have on the prediction to be tnatle. Tlier aprproac'his ca.llecl"naive" because
it assunresthe independence betlreen the various attribute values. Givert
a data value a;; the probabilitv thrrt a relatecl tuplc. li. is irr class C, is
describedbv P(C'; lrr;). Trairring cl:rtacan be usccito rleterrrriueP(t,),
P(ri I C,). arrcl P(Ci)
Fr<lrnthese values. Baves theorenr alllowsus to
e s t i n i a t et h e p o s t e r i o rp l o b a b i l i t v P ( C , l . r ' ; ) a r r c lt h e n P ( C ; l t i ) .
Giverra trairiirrgsert.the rraiveBal.esalgoritlinr first estinratestht'prior
probabilitl P(C;) for each class lty counting horn ofteu etrcli class occurs
in thc tr:rinirrg clata. For euch attributc. :r:,. thc rrrunberrof occurrerrces
of erachattliltrtte value .r'; c:rn be c:orrutcdto tlett'rnritrcP(.r,). Sirrrilarlr'.
the plobabilitl P(r', ) C;) can bc cstirnated by t--ountingholv ofterr eaclr
I'ahte r,rcc:ttLs
in tire classin the trairring <lata. Nofc that \\.eilt'c lookirrg at
attriliute valttes tiere. A tuple irr the trairring rlat:r nrzn'ir:rvt'rrr:rr11.
cliff<.r't'nt
attributes. each with tnatrt-valtrcs.This rrrustlrt' ilonc frx all tittrilmtes arrd
all ralues of attrilrutes. \Ve therr use theseclerivcrllrrolrabiliticswiren a rre'r,r.
tuple rnust be classified. Tllt i. whv naive Bal't-.sclassilicatiorir,anbe vierw'ed
as troth a descriptive and a predictive type of aigoritiurr. Ihe probabilities
Chapter4
Data Mining-lntroductory and AdvancedTopics
are descriptive and are then used to predict the class membership for a
target tuple.
When classifving a target tuple, the conditional and prior probabilities generated frorn the training set are used to make the prediction. This is done
by combining the effects of the different attribute values from the tuple.
Supposethat tuple t.; has p independentattribute values {rur,rnz,. '. 'rip}
From the descriptive phase, we know P(r* | Ci), for each class C, and
attribute ri;,. We then estimate P(ti lCi) bV
P(tilcr):fle@t"lc,)
(4.8)
A:1
TABLE4.3: Probabilities Associated rvith Attributes
Attribute
Gender
Height
Value
\T
Count
Probabilities
Short
l\Iedium
TalI
1
2
6
0
0
3
0
0
0
0
0
1
2
F
3
( 0 ,1 . 6 1 2
(1.6,1.71 2
( 1 . 71, . 8 1 0
(i.8,1.91 0
(1.9,2l 0
(2, co)
At this point in the algoritlim, we then have the needed prior probabilities
P(C) for each classand the conditional probability P(ti I C1). To calculate
P(ti), we can estimate the likelihood that t6 is in each class. This can be
done by finding the likelihood that this tuple is in each classand then adding
all these values. The probability that ti is in a class is the product of the
conditional probabilities for each attribute value. The posterior probability
P(Ci I ti) is then found for each class. The classwith the highest probability
is the one chosen for the tuple. Example 4.5 illustrates the use of naive
Baves classification.
Classification 85
0
.f
^
1
0
Short
Medium
tl4
314
218
618
313
013
2/4
00
00
318
418
r/8
0
o
o
rl3
2/3
, /A
0
0
0
0
Tall
Combining these, we get
Likelihood of being short :
Likelihood of being mediurn :
Likelihood of being tall :
0 x 0.267 : 0
0.031 x 0.533 : 0.0166
0.33 x 0.2 : 0.066
( 4.e)
(4.10)
(4.11)
We estimate P(t) by sunrming up these individual likelihood values since I
will be either short or medium or tall:
P(t) :0 + 0.0166+ 0.066: 0.0826
14.r2)
Finally, we obtain the actual probabilitiesof eachevent:
EXAMPLE
4.5
P ( s h o r tl t )
Using the Outputl classification results for Table 4.1. there are four tuples classified as short, eight a; medium. and three as tall. To facilitate
classification, we divide the height attribute into six ranges:
( 0 ,1 . 6 1( 1, . 61, . 7 1( 1, . 71, . 8 1( 1. . 81, . 9 1( 1, . 92, . 0 1( 2, . 0m
,)
Table 4.3 shows the counts and subsequentprobabilities associatedwith the
attributes. With these training data, we estimate the prior probabilities:
P(short) : 4lI5 : 0.267,P(medium) : 8/15 : 0.533,
and P(tall) :3115:0.2
We use these valuesto classify a new tuple. For example. supposewe wish to
classify 1: (Adam,.1'l.1,1.95 m) . By using these values and the associated
probabilities of gender and heiglrt, we obtain the following estimates:
P(f lshort):
P ( f l m e d i u m ):
P(t I tall) :
ll4 x 0:0
2 1 8x 1 / 8 : 0 . 0 3 1
3l3x 1/3: 0.333
P ( m e d i u ml i )
P ( t a l rl r ) :
0 x 0.0267
0.0826
x 0.533
0.031
_-il.,
0.0826
0.333x 0.2
: 0.799
0.0826
_:ll
(4.13)
(4.14)
(4.15)
Therefore, based on these probabilities, we classifl' the new tuple as tall
becauseit has the highest probability.
The naive Bayes approach has severa.l advantages. First, it is easv
to use. Second, unlike other classificatiou approaches, only one scan of
the training data is required. The naive Bayes approach can easily handle
missing values by sipply omitting that probability when calculating the
likelihoods of membership in each class. In cases where there are sirnple
relationships, the technique often does yield good results.
Although the naive Bayes approach is straightforward to use, it does
not always yield satisfactory results. First. the attributes usually are not
independent. we could use a subset of the attributes by ignoring any rhat
are dependent on others. The technique does not handle continuous data.
Dividing the continuous values into ranges coulcl be used to solve this probIem, but the division of the domain into ranges is not an easy task, and
how this is done can certainly impact the results.
Chapter4
ALGORITHMS
DISTANCE.BASED
Alcomtnu
Each itelr that is rnapped to the same cla.ssmay be thought of as rnore
similar to the othel itcrus in that class than it is to the items found in
other classes. Therefore, similarity (or distance) measuresmay be used to
identify the "alikeness" of different items in the database.
4.1
Input:
//Centers for each class
c1,...,c7t
t
/ /Irrput tuP16 to classifY
Output:
c
/ /Class to which t is assigaed
Sinpl e distance-based al gori thm
dist : oo;
fori::1tonrdo
i f dis(ci, t) < aist, then
c: ii
dist : dist(c4, t);
Using a similartty rneasurefor classification where the classesare preclefinedis somewhat sinrplcr than using a similarity measure for clustering
where the cla^sses
are not known in advance. Again, think of the IR example.
Each IR. query provides the classdefinition in the form of the IR query itself.
So the cla,ssificationproblenr then becomes one of determining siurilarity
not among all tuples in the database but between each tuple and the query.
This makes the problern arr O(n) ploblen rather rhan an O(n2) problem.
Figure 4.8 illustrates the use of this approach to perform classification
using the data found in Figure 4.1. The three Iarge dark circles are the class
representatives for the three classes. The dashed lines show the distance
from each item to the closest center.
Simple Approach
Usirrg the IR approach, if we have a representative of each class, we can
perforrn classificationby assigningeach tuple to the class to which it is most
similar. We assume here that each tuple, 11,in the database is defined as
a vector (trt,tir,. . . ,tix) of nunreric values. Likewise, we assume that each
class C7 is defined by a tuple (Cir,Ciz,...,Cid
of numeric values. The
classification problem is then restated in Definition 4.2.
t0
*-------9u
-::---'''
I
8
_- - - - - :=f'===- - 1a
7
DprrNlrrou 4.2. Given a databaseD : {tr,tz,. . . . ,fr} of tuples where
each tuple tt. : (tn,ti2,. .. , i;6) contains numeric values and a set of
classeC
s : { C t , . . . , C ^ } w h e r e e a c hc l a s sC i : ( C i t , C 1 z , . . . , C i , r )
has numeric values, the classification problem is to assign each l; to
the classCi such that (t1,C;) > sim(i;,Ct)ye € C where e * Ci.
6
5
x\
x
i\
J
To calculate these similarity measures, the representative vector for
eachclassmust be determined. Referring to the three classesin Figure 4.1(a),
we can determine a representative for each class by calculating the center
of each region. Thus class A is representedby \4,7.5), class B by (2,2.8),
and class C by (6,2.5). A simple classification technique, then, would be to
place each item in the class where it is most similar (closest) to the center
of that class. The representative for the class may be found in other ways.
For example, in pattern recognition problems, a predefined pattern can be
used to represent each class. once a similarity mea"sureis defined, each
itenr to be classifiedwill be cornpared to each predefined pattern. The item
will be placed in the class with the largest similarity value. Algorithm 4.1
illustrates a straightforward distance-based approach assuming that each
class, q, is represented by its center or centroid. In the algorithm we use
c; to be the center for its class. since each ttrple must be compared to the
center for a class and there are a fixed (usually small) number of classes.
the complexitv to classify one tuple is O(n).
Classification 87
C
"]s X - . x
2
I
ClassB
o- 0L
FIGURE4.8: Classification using simple distance algorithm.
4.3.2
K Nearest Neighbors
One common classification scheme based on the use of distance measures
is that of the K nearest neighbors (KNN). The KNN technique assumes
that the entire training set includes not only bhe daba in the set but also
the desired classification for each item. In effect, the training data become
the model. When a classification is to be made for a new item, ibs distance
to each item in the training set must be determined. Only the K closest
entries in the training set are considered further. The new item is then
placed in the class that contains the most iterns from this set of K closest
items. Figure 4.9 illustrates the process used by KNN. Here the points
f8
Chapter 4
Data Mining-lntroductory and AdvancedTopics
10
x
x
lt
c:
x----l
begin
w : l t /- { u } ;
irl: l'U {d};
end
//Find class for classification
class to wbich the most u € iV are classified;
Example 4.6 illustrates this technique using the sample data from
Table 4.1. The KNN technique is extremely sensitive to the value of K.
A rule of thumb is that 1( < @
[KLR+98]. For
this example, that value is 3.46. Commercial algorithms often use a default
value of 10.
x
EXAMPLE
FIGURE4.9: Classification using KNN.
in the training set are shown and K : 3. The three closest items in the
training set are shown; t will be placed in the class to which most of these
are members.
Algorithm 4.2 outlines the use of the KNN algorithm. We use ? to rep
resent the training data. Since each tuple to be classifiedmust be cornpared
to each element in the training data, if there are q elements in the training
set, this is O(q). Given n elements to be classified, this becomes an O(nq)
problem. Given that the training data are of a constant size (altho
perhaps quite large), this can then be viewed as an O(n) problem.
data
//Training
//Nunber of neighbors
tuple to classify
//Itput
/ /Class to whicb t is assigued
KNN algorithn:
to classify tuple using KNN
//Algoritbn
N: 0 ;
for
/lFilid set of neighbors,
each d€ T do
i f I r\rl< K, then
trl: IvU {d};
else
if
4.6
Using the sample data from Table 4.1 and the Outputl classification as
the training set output value, we classify the tuple (Pat, F, 1.6). Only the
height is used for distance calculation so that both the Euclidean and Manhattan distance measures yield the same results; that is, the distance is
simply the absolute value of the difference between the values. Suppose
that K : 5 is given. We then have that the K nearest neighbors to
the input tuple are {(Kristina,4 1.6),(Kathy,4 1.6), (Stephanie,F,1.7),
(Dave, tr[,L7], (Wynette, 41.75)].
Of these five items, four are classified as short and one as medium. Thus, the KNN will classify Pat as short.
4.4 D E C I S I O N T R E E - B A S E D A L G O R I T H M S
The decision tree approach is most useful in classification problems. With
this technique, a tree is constructed to model the classification process.
Once the tree is built, it is applied to each tuple in the database and results
in a classification for that tuple. There are two basic steps in the technique:
building the tree and applying the tree to the database. Most research
has focused on how to build effective trees as the application process is
straightforward.
Ar,conrrnrvr 4.2
Input:
T
K
t
0utput:
c
Classification 89
JV, for
t
I u € i V s u c h t h a t s i n ( t , u ) < s i n ( t , d ), t h e n
The decision tree approach to classificationis to divide the searchspace
into rectangular regions. A tuple is classified based on the region into which
it falls. A definition for'a decision tree used in classification is contained in
Definition 4.3. There are alternative definitionsl for example, in a binary
DT the nodes could be labeled with the predicates themselvesand each arc
would be Iabeled with yes or no (like in the "Twenty Questions" game).
DorrNrrroN 4.3. Given a database D : {tr,. . . ,tn} where f2 :
(tt,. . ., til) and the database schemacontains the following attributes
s : {Ct,...,C*}. A
{ A r , A z , . . . , A n } . A l s o g i v e ni s a s e t o f c l a s s e C
decision tree (DT) or classification tree is a tree associatedwith
D that has the following properties:
Chapter4
Data Mining-lntroductory and AdvancedTopics
o Each internal node is labeled with an attribute, Ai.
r Each arc is labeled with a predicate that can be applied to the
attribute associatedwith the parent.
o Each leaf node is labeled with a class, C3.
Solving the classification problem using decision trees is a twostep process:
1. Decision tree induction:
Construct a DT using training data.
2. For each fi e D, apply the DT to determine its class.
Based on our definition of the classification problem, Definition 4.1, the
constructed DT representsthe logic neededto perform the mapping. Thus,
it irnplicitly defines the mapping. Using the DT shown in Figure 3.5 from
Chapter 3, the classification of the sample data found in Table 4.1 is that
shown in the column labeled Output2. A different DT could yield a different
classification. Since the application of a given tuple to a DT is relatively
straightforward, we do not consider the second part of the problem further. Instead, we focus on algorithms to construct decision trees. Several
algorithms are surveyed in the following subsections.
There are many advantages to the use of DTs for classification. DTs
certainly are easy to use and efficient. Rules can be generated that are easy
to interpret and understand. They scale well for large databases because
the tree size is independent of the database size. Each tuple in the database
must be filtered through the tree. This takes time proportional to the height
of the tree, which is fixed. tees can be constructed for data with many
attributes.
Ar,conruru 4.3
Input:
D
,/,/Training data
Output:
T
//Decision tree
D T B u il d a l g o r i t h m :
algorithro to i]lustrate
//Simplistj.c
to buildiug DT
naive approach
T: A;
Deterroine best splitting
criterion;
T: Create root node node atrd label with splitting
attribute;
T: Add arc to root node for each split predi-cate and
1abe1;
for each arc do
D: Database created by applying spli.tting predicate
to D;
if stopping point reached for this path, then
d:
Create leaf node and label with appropriate claas;
Classification .91
el se
1{ : DTBuild(D);
I:
Add I' to arc;
Disadvantages also exist for DT algorithms. First, they do not easily
handle continuous data. These attribute domains must be divided into
categories to be handled. The approach used is that the domain space
is divided into rectangular regions [such as is seen in Figure 4.1(a)]. Not
all classification problems are of this type. The division shown by the
sinrple loan classification problem in Figure 2.4(a) in Chapter 2 cannot
be handled by DTs. Handling missing data is difficult because correct
branches in the tree could not be taken. Since the DT is constructed from
the training data. overfitting may occur. This can be overcome via tree
pruling. Finally, correlations among attributes in the database are ignored
by the DT process.
The major factor in the performance of the DT building algorithm is
the size of the training set. The following issues are faced by most DT
algorithms:
o Choosing splitting attributes: Which attributes to use for splitting attributes impacts the performance applying the built DT. Some
attributes are better than others. In the data shown in Table 4.1, the
name attribute definitely should not be used and the gender rnay or
may not be used. The choice of attribute involves not only an examination of the data in the training set but also the informed input of
domain experts.
o Ordering of splitting attributes:
The order in which the attributes are chosen is also important. In Figure a.10(a) the gender
attribute is chosen first. Alternatively, the height attribute could
be chosen first. As seen in Figure 4.10(b), in this case the height
attribute must be examined a second time, requiring unnecessary
comparisons.
r Splits: Associated with the ordering of the attributes is the number
of splits to take. With some attributes, the domain is small, so the
number of splits is obvious based on the domain (as with the gender attribute). However, if the domain is continuous or has a large
number of values, the number of splits to use is not easily determined.
o Tbee structure:
To improve the performance of applying the tree
for classification, a balanced tree with the fewest levels is desirable.
However, in this case, more complicated comparisons with multiway
branching fseeFigure 10(c)] may be needed. Some algorithms build
only binary trees.
r Stopping criteria: The creation of the tree definitely stops when
the training data are perfectly classified. There rnay be situations
when stopping earlier would be desirable to prevent the creation of
larger trees. This is a trade-off between accuracy of crassification
92
Data Mining-lntroductory and AdvancedTopics
Chapter4
and performance. In addition, stopping earlier mav be per
to prevent overfitting. It is even conceivable that more levels
needed would be created in a tree if it is known that there are d
distributions not representedin the training data.
o Training data: The structure of the DT created depends on
training data. If the training data set is too small. then the
tree might not be specific enough to work properly with the
general data. If the training data set is too large, then the
tree may overfit.
o Pruning: Once a tree is constructed, some modifications to the
might be needed to improve the performance of the tree during
classification phase. The pruning phase might remove redundant
parisons or remove subtrees to achieve better performance.
In the following subsectionswe examine several popular DT approaches.
Height
Gender
Short
Gender
Heighl
Height
<r32lt\8m .''Z'fy
Short Medium
.=t.8m,/
/\
Tall
Short
Medium
Tall
Medium
n
\.t.9In
< 1 . 5m
Tall
Short
.3 1.5 1.82
(a) Balanced tree
(b) Deep tree
>2m
>=1.3 r
< 1 . 5m
/
.5m >1.8
.8m
Medium
Gender
=F,/ \=r'a
/\
Medium
Short
1.3 1.5 |.8 2.0
(c) Bushytree
FIGURE4.10: Comparing decision trees.
4.4.1 lD3
The ID3 technique to building a decision tree is based on information theory and attempts to minimize the expected number of comparisons. The
basic idea of the induction algorithm is to ask questions whose answersproyide the most information. This is similar to the intuitive approach taken
by adults when plaf ing the "Twenty Questions" game. The first question
an adult might a^skcould be "Is the thing alive?" while a child might ask
''Is it my Daddy?" The first question divides the search space into two
Iarge search domains, while the second performs Iittle division of the space.
The basic strategy used by ID3 is to choose splitting attributes with the
highest information gain first. The amount of information associatedwith
z,n attribute value is reiated to the probability of oecurrence. Looking at
the "Twenty Questions" example, the child's question divides the search
space into two sets. One set (Daddy) has an infinitesimal probability associated with it and the other set is almost certain, while the question the
aduit makes divides the search space into two subsets with almost equal
orobabilitv of occurrins.
Tall
,"-\
Height
Classification 93
<=1.5m
Short Medium
The concept used to quantify information is called entropy. Entropy
is used to measure the amount of uncertainty or surprise or randomness in
a set of data. Certainly, when all data in a set belong to a single class,
there is no uncertainty. In this case the entropy is zero. The objective of
decision tree classification is to iterativeiy partition the given data set into
subsetswhere all elements in each final subset belong to the same class. In
Figure 4.11(a, b, and c) will help to explain the concept. Figure 4.11(a)
shorvs log(l/p) as the probability p ranges from 0 to 1. This intuitively
sholvs the amount of surprise based on the probability. When p : 1,
there is no surprise. This means that if an event has a probability of 1
and you are told that the event occurred, you would not be surprised. As
p - 0. the surprise increases. When we deal with a divide and conquer
approach such as that used with decision trees, the division results in multipie probabilities whose sum is 1. In the "Twenty Questions" game, the
P(Daddy) < P(-Daddy) and P(Daddy) * P(-pu66y) : 1. To measure
the information associatedwith this division, we must be able to combine
the information associated with both events. That is, we must be able to
calculate the averageinformation associatedwith the division. This can be
performed by adding the two values together and taking ilto account the
probability that each occurs. Figure 4.11(b) shows the function plog(Ild,
tvhich is the expected information based on probability of an event. To
determine the expected information associatedwith two events, we add the
individnal vaiuestogether. This function plog(lld + (1 - p) Iog(1/(1 - p))
is plotted in Figure a.11(c). Note that the maximurn occurs when the two
probabilities are equal. This supports our intuitive idea that the more sophisticated questions posed by the adult are better than those posed by the
child.
4
0.2 0.4 0.6 0.8
(a) log(l/p)
u.)
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Classification 95
Chapter4
Data Mining-lntroductory and AdvancedTopics
Heighr
<=1.7m
A
<=t.95m/
\
> 1 . 7m
<=1.95m
>t..l5n
/\
t""
(a)original,r.J"o'u*
o.2 0.4 0.6 0.8
(b)p log(1/p)
0
0.2 0.4 0.6 0.8
(c)H(p,1 - p)
FIGURE4.11:Entropy.
Drrnrrrrorlr 4.4. Given probabilities pt,pz, .
entropy is defined as
where!i-,
pi : l,
s
(4.r6)
Given a database state, D. fI(D) finds the arnount of order (or lack
thereof) irr that state. When that state is split into s rrew states S :
w e c a n a g a i n l o o k a t t h e e n t r o p y o f t h o s es t a t e s . E a c h
{Dt,Dr,...,D"},
step in ID3 choosesthe state that orders splitting the most. A database
is completely ordered if all tuples in it are in the sanre class. ID3 chooses
the splitting attribute with the highest gain in information. where gain is
defined as the differencebetween how much information is neededto rnake a
correct classification before the split versus how much information is needed
after the split. Certairily, the split should reduce the infomiation needed
by the largest amount. This is caiculated b1' rieterrnirring the differences
between the entropies of the original dataset and the weighted sum of the
entropies from each of the subdivided datasets. The entropies of tire split
datasets are weighted by the fraction of the dataset beirrg placed in that
division. The ID3 algorithm calculates the qairt of a particular split by the
follorving fcrrmula:
_j_
G a i r r ( DS. ) : H ( D ) - L
P(n,)II(D,)
FIGURE4.12: Classification problem.
EXAMPLE
The formal definition of entropy is shown iu Definitiorr 4.4. The value
entropy is between 0 and 1 and reachesa rnaximurn when the probabilit
are all the sanre.
(r.17)
Exanrple 4.7 arrd assciciatedFigure 4.12 ilhrstrate this processrrsing the
heigirt exantple. In this exarnple. six clivisions of the possible r.arrgesof
heigirts are uscd. This division into ranges is neetled u'hen the donrain of
aIr attriltute is contiurtortsor (as in this case) consists of rnany possible
valtres. While the chtiice of these divisions is sonrewhat arbitrary, a donrain
cxpert shorrldbe able to perfornr the task.
(b) Optimizedtree
4.7
The beginning state of the training data in Table 4.1 (with the Outputl
is that (4115) are short, (8/15) are medium, and (3/15) are
clrnssification)
tall. Thus. the entropy of the starting set is
allt5rog(t\14)+ 8/1slog(15/8)
+ 3lrs log(15/3): 0.4384
Choosing the gender as the splitting attribute, there are nine tuples that
are F and six that arc M. The entropy of the subset that are ,F is
3/s log(9/3)+ 6/s log(s/6): o.2764
(4.18)
whereas that for the ,4/ subset is
t / 6 l o 9 ( 6 / 1 ) + 2 1 6 I o 9 ( 6 1 2 )+ 3 l 6 l o 9 ( 6 / 3 ) : 0 . 4 3 e 2
(1.1e)
The ID3 algorithm must determine what the gain in information is by
using this split. To do this. we calculate the weighted sutn of these last two
entropies to get
:0.34152
((e/15)0.2764)
+ ((6/15)0.43e2)
(4.20)
Tlie gain in entropy by usingthe genderattribute is tlius
- 0.34152: 0.09688
0..1384
(4.21)
Looking at the heigirt attribute, we have two tuples that are 1.6, two are
1.7, otre is 1.75,two are 1.8. oue is 1.85,orreis 1.88.two are 1.9, one is 1.911,
one is 2, one is 2.1, and one is 2.2. Deterruining the split values for heiglrt
is not easy. Even though the training dataset has these 11 values. we krrow
that there will be manv lrrore. Just as rvith corrtirruousdata. we divide into
ranges:
( 0 . 1 . 6(11,. 6 . 1 . (7r1. 7, 1, . 8 1
( r.. 8 , 1 . e( 1
] .. e . 2 . (021..0c.c )
T h e r e a r e 2 t u p l e s i n t h e f i r s t d i v i s i o nw i t h e n t r o p y ( 2 1 2 ( 0 )+ 0 + 0 ) : 0 , 2
in (1.6, 1.71with entropy (2/2(O) + 0 + 0) : 0, 3 in (1.7, 1.81with entrop;.
DataMining-lntroductoryandAdvanced
Topics
Chapter4
( 0 + 3 / 3 ( 0 ) + 0 ) : 0 , 4 i n ( 1 . 8 ,1 . 9 w
1 i t h e n t r o p (y0 + 4 / 4 ( 0 )+ 0 ) : 0 , 2 i n
: 0.301,andtwo in the
(1.9,2.01with entropy(0+ 1/2(0.301)+
1/2(0.301))
last with entropy (0 + 0 + 212(0)): 0. All of thesestatesare completely
orderedand thus an entropy of 0 exceptfor the (1.9,2.0]state. The gain
in entropy by usingthe height attribute is thus
- 2/15(0.30r)
: 0.3983
0.4384
Classificatioi gl
Splitting: The ID3 approach favors attributes with rnany divisions
and thus rnay lead to overfitting. In the extreme, an attribute that
has a unique value for each tuple in the training set wr_ruldbe the
best because there would be only one tuple (and thus oue class) for
each division. An improvement can be made by taking into account
the cardinality of each division. This approach usesthe GainRatio as
opposed to Gain. The GainRatio is defined as
(4.22)
Thus, this has the greater gain, and we choosethis over gender as the first
splitting attribute. Within this division there are two males, one medium
and one tall. This has occurred because this grouping was too large.
A further subdivision on height is needed, and this generatesthe DT seen
in Figure a.l2(a).
GainRatio(D. ^9):
,(w #)
Gain(D,^9)
(4.23)
For splitting purposes, c4.5 usesthe largest GainRatio that ensuresa larger
than average information gain. This is to compensate for the fact that the
GainRatio value is skervedtoward splits where the size of one subset is close
to that of the starting one. Example 4.8 shows the calculation of GainRatio
for the first split in Example 4.7.
Figure a.l2(a) illustrates a problem in that the tree has multiple splits
with identical results. In addition, there is a subdivision of range (1.9,2.01.
Figure 4.12(b) shows an optimized version of the tree.
t.2 c4.5
EXAMPLE 4.8
The decision tree algorithm C4.5 improves ID3 in the following ways:
To calculate the GainRatio for the gerrder split, we first find the entropy
associatedwith the split ignoring classes
o Missing data: When the decision tree is built, missing data are
simply ignored. That is, the gain ratio is calculated by looking only
at the other records that have a value for that attribute. To classify
a record with a missing attribute value, the value for that item can
be predicted based on what is known about the attribute values for
the other records.
"(*
(4.24)
(4.25)
The entropy for the split on height (ignoring classes)is
,(+** +*)
c4.5:
r Rules: c4.5 allows classification via either decision trees or rures gerrerated from them. In addition, some techniques to simplify complex
rules are proposed. One approach is to replace the left-hand side of a
rule by a sinrpler version if all records in the training set are treated
identically. An "otherwise" type of lule carr be used to indicate what
should be done if no other rules apply.
:02e2
0.09688
: 0.332
0.292
There are two primary pruning strategies proposed in
- With subtree replacemenl, a subtree is replaced by a leaf node
if this replacement results in an error rate close to that of the
original tree. Subtree replacement works from the bottom of the
tree up to the root.
- Another pruning strategy, called subtree ra'ising, replaces
a subtree by its most used subtree. Here a subtree is raised from its
current location to a node higher up in the tree. Again, we rnust
determine the increase in error rate for this replacernent.
:*'o*(f) .**(f)
This gives the GainRatio value for the gender attribute as
o Continuous data: The basic idea is to divide the data into ranges
based on the attribute values for that item that are found in the
training sample.
o Pruning:
*)
4.4.3
(4.26)
CART
Classification and regression trees (CART) is atechnique that generates
a binary decision tree. As with ID3, entropy is used as a rneasureto choose
the best splitting attribute and criterion. unlike IDJ, however. where a
child is created for each sr-rbcategory,only two children are created. T6e
splitting is perfornred around what is cleterrninedto be the best split point.
At each step. an exhaustive searchis used to deterrnine the best split. where
"best" is defined by
o(s/t) :ZPrPn\l
j:\
e { 4 | t i l - P ( c , I t p 11
(4.27)
Chapter4
Data Mining-lntroductory and AdvancedTopics
This formula is evaluated at the current node, t, and for each possible
splitting attribute and criterion, s. Here tr and -R are used to indicate the
teft ana right subtrees of the current node in the tree. Pt, Pn are the
probability that a tuple in the training set will be on the left or right side of
the tree. This is defined* #ffi?*i*##.
we assumethat the right
branch is taken on equality. p(.Ci I t1) or P(C1 | tn) is the probability that
a tuple is in tliis class, Ci, and in the left or riglrt subtree' This is defined
., ltuplesof clm1l-I.:g!11sgl. At each step. orrly one criterion is chosen
as tne
target nodel
it"[ieiAt-tr"
as the besi over all p6ssible criteria. Example 4.9 shows its use with the
height example with Outputl results.
EXAMPLE
4.9
The first step is to determine the split attribute and criterion for the first
split. we again assume that there are six subranges to consider with the
height attribute. Using these rallges, we have the potentiai split values of
1.6,1.7,1.8,1.9,2.0. We thus have a choice of six split points, which yield
the following goodness mea"sures:
i D ( G e n d e r ):
2 ( 6 / 1 5 ) ( 9 1 1 5 ) ( 2 1 1+5 4 l L 5 + 3 / 1 5 ) : 0 ' 2 2 4
o(1.6) : 0
o ( 1 . 7 ): 2 ( 2 1 r 5 ) ( r 3 l 1 5+) (801 r 5+ 3 / i 5 ) : 0 . 1 6 e
5 3l15): 0'385
r 51 1 +
o ( 1 . 8 ): 2 ( 5 1 : 5 ) O 0 l 1 5 ) ( 1+1 6
+ 2l15+ 3/15): 0'256
o(1.e) : 2(sl15)(61r5)(4115
+ 8/15+ 3/15)- 0'32
o(2.0) : 2(121r5)(31r5)(41t5
(4'28)
u.29)
(4.30)
(4'31)
(4'32)
(4'33)
The largest of these is tire split at 1.8. The remainder of this example
left as an exercise.
Since gender is really unordered, we assurneM < F.
As illustrated with the gender attribute, CART forces that an orderi
of the attributes be used. cART handles missing data by simply ignori
that record in calculating the goodnessof a split on that attribute.
tree stops growing when no split will improve the performance. Note t
even though it is the best for the training data, it may not be the
for all possible data to be added in the future. The CART algorithm
contains a pruning strategy. which we will not discuss here but which
be fouud in IKLR+981.
1.4.4 Scalable DT Techniques
We briefly examine sonte DT techniques that address creation of DTs for
scalability datasets.
The SPRINT (Scalable PaRallelizabte INducti,on of decision Trees)
algorithm addlesses the scalability issue by ensuring that the CART
Classificatidn 99
technique can be applied regardless of availability of main memory. In
addition, it can be easily parallelized. With SPRINT, a gini index is used
to find the best split. Here gin'i for a database D is defined as
gini(D): t -Do?
(4.34)
where pi is the frequency of class C, in D. The goodness of a split of D
into subsets D1 and D2 is defined bY
:
gini"o11,(D)
+ P(gi,'i(Dr))
f {ri"i{,,))
(4.35)
The split with the best gini value is chosen. Unlike the earlier approaches,
SPRINT does not need to sort the data by goodness value at each node
during the DT induction process. Witir contiuuous data, the split point
is chosen to be the midpoint of every pair of consecutive values from the
training set.
By maintaining aggregate metadata concerning database attributes,
the RainForest approach allows a choice of split attribute without needing
a trainirrg set. For each node of a DT, a table called the att'ri'b'ute-ualue
class (AVC ) Iabel grouyt is used. The table surnrnarizesfor au attribute the
count of entries per class or attribute value grouping. Thus, the AVC table
summarizes the information needed to determine splitting attributes. The
size of the table is not propoltional to the size of the database or training
set, but rather to the product of the rrumber of classes.unique attribute
values, and potential splitting attributes. This reduction in size (for large
training sets) facilitates the scaling of DT induction algorithms to extremely
large training sets. During the tree-building phase, the training data are
scanned, the AVC is built, and the best splittirrg attribute is chosen. The
algorithm continues by splitting the training data and constructing the AVC
for the next node.
4.5 NEURAL NETWORK-BASED ALGORITHMS
S/ith neural networks (NNs). just as n'ith decision trees. a model representing how to classify anv given database tuple is constructed. The activation
functions typicallv are sigmoidai. When a tuple must ber classifiecl.certain attribute values from that tuple are irrput into the directed graph at
the corresponding source nodes. There often is one sink node f<rr each
class. The output value that is gerreratetl indicates the probability that
the corresponding input tuple belongs to that class. I'he tuple will then
be assignedto the class with the highest probability of rlernbersirip. The
learning process rnodifies the labeling of the arcs to l)etter classify tuples.
Given a starting stlucture and'alue for all the labels in the graph. as each
tuple in the training set is sent through the rretwork, tlie projected classification made by the graph can be cornpared with the actual classific:rtion.
Based on the accuracy of the prediction, va.riouslabelings in the graph carr
ll
DO Data Mining-lntroductory and AdvancedTopics
Chapter4
Classification 101
types of activation functions
hnique for adjusting the weiglrts is
many approachescan be used,
Although
technique.
learning
calletl the
of backpropagation, which
form
is
sorne
approach
commoll
the nost
subsection'
a
subsequent
in
is discussed
The learning may stop when all the training tuples have
o stop:
propagated through the network or may be based on time or error
1. Deterrnine the number of output nodes as lvell as what attributes
should be used as input. The number of hidden layers (between the
source and the sink nodes) also must be decided. This step is performed by a domain exPert.
2. Determine weights (labels) and functions to be used for the graph.
rate.
3. For each tuple in the training set, propagate it through the network
and evaluate the output prediction to the actual result. If the prediction is accurate, adjust labels to ensure that this prediction has a
higher output weight the next time. If the prediction is not correct,
adjust ttre weights to provide a lower output value for this class.
4. For each tuple ti e D, propagate ti through the network and make the
appropriate classification'
There are many advantagesto the use of NNs for classification:
o NNs are more robust than DTs becauseof the weights'
o The NN improves its performance by learning. This may continue
even after the training set has been applied'
o The use of NNs can be parallelized for better performance'
r There is a low error rate and thus a high degree of accuracy once the
appropriate training has been performed'
There are many issuesto be examined:
(number of source nodes): This is the same issue as
o Attributes
determining which attributes to use as splitting attributes'
o Number of hidden layers: In the simplest case, there is only one
hidden layer.
o Number of hidden nodes: choosing the best number of hidden
nodes per hidden layer is one of the most difficult problems when
using NNs. There have been mauy empirical and thecretical studies attempting to answer this question. The answer depends on the
structure of the NN, types of activation functions, training algorithm,
and problem being solved. If too few hidden nodes are used, the target function may not be learned (underfitting). If too many nodes are
used, overfitting may occur. Rules of thumb are often given that are
based on the size of the training set.
o rnterconnections:
In the simplest case, each node is connected to
all nodes in the next level.
o weights: The weight assigned to an arc indicates the relative weight
between those two nodes. Initial weights are usually assumed to be
srnall positive numbers and are assigned randomly'
\-,
n,'ouodifferent
{%ti.X?ra';iPnu'|ffi
change. This learning process contimres with all the training data or until
the classificatiou accuracy is adequate.
Solving a classification problern using NNs iDvolvesseveral steps:
o Tlaining data: As with DTs, with too much training data the NN
may suffer from overfitting, while too little and it may not be able to
classify accurately enor.rgh.
o Number of sinks: Although it is usually assumed that the number of output nodes is the same as the nurnber of classes,this is not
always the ca,se.For example, with two classesthere could only be one
output node, with the resulting value being the probability of being in
the associated class. subtracting this value from one would give the
probability of being in the second class.
sJ
LIBRARY
o NNs are more robust than DTs in noisy environments'
Conversell', NNs have many disadvantages:
o NNs are difficult to understand. Nontechnical users rnay have difficulty
unclerstanding how NNs work. While it is easy to explain decision
trees, NNs are much urore difficult to understand'
o Generating rules from NNs is not straightforward'
r Input attribute values must be numeric'
o Testing
o Verification
o As with DTs, overfitting may result.
o The learning phase may fail to converge'
o NNs may be quite expensive to use.
4.5.1
Propagation
The normal approach u3ed for processing is called propagation. Given a
one value is input at
tuple of values input to the NN, X : \rt,.'.,rn),
eerchnode in the input layer. Then the summation and activation functions
are applied at each node, with an output value created for each output arc
from that node. These values are in turn sent to the subsequent nodes'
This processcontinuesuntil a tuple of output values,Y : (Vt,....g-), is
produced from the nodes in the output layer. The process of propagation
is shown in Algorithm 4.4 using a neural network with one hidden layer.
Here a hyperbolic tangent activation function is used for the nodes in the
hidden layer. while a sigmoid function is used for nodes in the output layer.
.rl
2
Data Mining-lntroductory
Chapter4
and Advanced Topics
We assume that the constant c in the activation function has been provided.
We also use A to be the nurnber of edges coming into a node.
Ar,conrrnu 4.4
Input:
il
X : (rr, . . . , oh)
,//neural network
/ /Input tuple consisting
only
input attributes
of values for
Output:
Y: (Ar,'..,!n)
//TupLe consisting of outPut values fron Nll
Propagat ion algor i thm :
illustrates
propagatj.on of a tuple
//Algoritbn
through a NN
for each node ,i in the input layer do
Output .Di on each output arc from d;
for each hidden )-ayer do
for each node i do
/
- t\
o .d-:/ r - k
D
\ 2 _j = t \ w j i a j i ) ) i
for each output arc fron i do
outPut
for
NN SuPervised Learning
The NN starting state is modified based on feedback of its performance
with the data in the training set. This type of learning is referred to as
superu'isedbecauseit is known a priori what the desired output should be.
(Insuperu'ised learning can also be performed if the output is not known.
With unsupervised approaches,no external teacher set is used. A training
set may be provided, but no labeling of the desired outcome is included. In
this case, similarities and differences between different tuples in the training
set are uncovered. In this chapter, we examine supervised learning.
x 1 lender
X<
J4
x2 Height
Medium
J)
ttl '
"-'"=t '
(l+;:-est
each node i in the output
'\
roi . -: / r - & j : l t/ n t i t ? , i t )
\L
)i
Output gi:
4.5.2
Classification.103
layer
do
7---jcsT;,
[1+e
)
A simple application of propagation is shown in Example 4.10 for the
height data. Here the classification performed is the same as that seenwith
the decisiontree in Figure 4.10(d).
EXAMPLE 4.10
Figure 4.13 shows a very simple NN used to classify university students as
short, medium. or tall. There are two input nodes, one for the gender data
and one for the height data. There are three output nodes, each associated
with one class and using a simple threshold activation function. Activation
function /3 is associated with the short class, /a is associated with the
medium class, and /5 is associated with the tall class. In this case, the
weights of each arc from the height node is 1. The weights on the gender
arcs is 0. This implies that in this case the gender values are ignored. The
plots for the graphs of the three activation functions are shown.
FIGURE4.13: Example propagation for tall data.
Supervised learning in an NN is the process of adjusting the arc weights
based on its performance with a tuple from the training set. The behavior
of the training data is known a priori and thus can be used to fine-tune the
network for better behavior in future similar situations. Thu$, the training
set can be used as a "teacher" during the training process. The output from
the network is compared to this known desired behavior. Algorithm 4.5
outlines the steps required. One potential problem with supervised learning
is that the error may not be continually reduced. It would, of course, be
hoped that each iteration in the learning process reduces the error so that
it is ultimately below an acceptable level. However, this is not always the
case. This may be due to the error calculation technique or to the approach
used for modifying the weights. This is actually the general problem of NNs.
They do not guarantee convergence or optimality.
Alcomrnw
Input:
IT
x
D
4.5
neural- network
//Starting
tuple froro training
//Irpt*
//Output tuple desired
set
,+
ljara Mrntng-tntroductory
anct Advancecl loprcs
LnaPfer
//Inproved neural network
aLgorithn:
algorith:n to illustrate
//Simplisti.c
approach
to NN learning
Propagate X througb iV producing output Y;
CalcuLate error by comparing D to I;
Update weights on arcs in JVto reduce error;
Backpropagation is a learning technique that adusts weights in the
i{N bv propagating weight changes backrvard from the sink to the source
nodes. Backpropagation is the most well known form of learning because
it is easy to understand and generally applicable. Backpropagation can be
thought of as a generalized delta rule approach.
Notice that this algorithm must be associatedwith a means to calculate
tiie error as well as some technique to adjust the weights. trIany techniques
ha'e been proposed to calculate the error. Assuming that the output from
node i is yi but should be d1, the error produced from a node in any layer
cau be found by
(4.36)
Tlte mean squared enor (fuISE) is found by
@'' - do)2
2
(4.37)
This NISE can then be used to find a total error over all nodes in the net'*'ork
or over only the output nodes. In the following discussion.the assumption
is urade that only the final output of the NN is knon'n for a tuple in the
tra.ining data. Thus. the total MSE error over ail m output nodes in the
NN is
$ (uo a,)'
?m
(4.38)
This formula could be expanded over all tuples in the training set to seethe
total elror over all of them. Thus. an error can be calculated for a specific
test tuple or for the total set of all entries.
The Hebb and delta rules are approachesto change the weight on an
input arc to a node based on the knowledge that the output value from
that node is incorrect. with both techniques, a learning rule is used to
modify' the input weights. suppose for a given node, i. the input weights
are representedas a tuple (ur1r.... , ?rkj). while the input and output values
arc (er1i,. . . , x:*t) and yi, respectively. The objective of a learning technique
is to change the rveights based on the output obtained for a specific input
tuple. The change in weights using the Hebb rule is represented by the
following ruie
Lwi,
:
c:riili
I
l#
Figure 4.14 shows the structure and use of one node, j, in a neural
network graph. The basic node structure is shown in part (a). Here the
representative input arc has a weight of w73, where ? is used to show that
the input to node j is coming from another node shown here as ?. Of
course.there probably are multiple input arcs to a node. The output weight
is similarly labeled urir. During propagation, data values input at the
input layer flow through the network, with final values coming out of the
network at the output layer. The propagation technique is shown in part
(b) Figure 4.14. Here the smaller dashed arrow underneath the regula"r
graph arc shows the input value r7i flowing into node j. The activation
function ,fi is applied to all the input values and weights, with output
values resulting. There is an associated input function that is applied to
the input values and weights before applving the activation function. This
input function is typicall-v* a weighted sum of the input values. Here yiz
sirorvs the output lalue flowing (propagating) to the next node from node
j. Thus. propagation occurs by applving the activation function at each
node, which then places the output value on the arc to be sent as input to
the next nodes. In most cases,the activation function produces only one
output value that is propagated to the set of connected nodes. The NN can
be used for classification and/or learning. During the classification process,
only propagation occurs. However, when learning is used after the output
of the classification occurs, a comparison to the known classification is used
to determine how to change the weights in the graph. In the simplest
types of learning, learning progresses from the output layer backward to
the input layer. trVeightsare changed based on the changes that were made
in weights in subsequent arcs. This backward learning process is called
backpropagation and is illustrated in Figure 14(c). Weight 'uir is modified
to become utiT* Lui7. A learning rule is applied to this Luiz to determine
the change at the next higher level Atu7j.
a2j /\
uj'!
r.tj u
yjt
uar
='n/rFTr
(4.3e)
Here c is a constant often called tlrc learninq rate. A rule of thumb is that
"
(a) Node j in NN
(b) Propagationat Nodej
(1.10)
A
u'i?
1-{r;4:?
v
Au,1
Aui."
(c) Back-propagation
at Node j
entrit.s in training setl'
A variation of this approach, called the delta rule. examines not onll' the
output value y, but also the desired value di for output. In this case the
change in weight is found by the rule
Ltuii:crii(dr-gr)
LtdssiltLdf.l(Jtl
The nice feature of the delta rule is that it minimizes the error di - yi at
each node.
Output:
ff
suplearn
lv,-dil
,+
FIGURE 4.14: Neural network usaqe.
i
Chapter 4
Data Mining-lntroductory and AdvancedTopics
Ar,comrnrvr 4.6
InDut:
It
IT
X: (or,...,oh)
D: (dr,...,dn)
neural network
/ /Starting
//TnPttt tuple fron training
//OutPut tuPle desired
OutDutss
//InProved neural
lI
algorJ.tbn :
BackgroDagation
//IUustrate
Propagation(lI, X) ;
set
network
backProPagat ion
E : r l 2 D T : , ( d t- y + ) 2 ;
Gradient algorithn:
incremental gradient
//Illustrates
for each node i in outPut laYer do
for each node j inPut to ri do
L , u t i i : n @ t ,- y ) y j ( r - y ) y i ;
w 1 ' t* L w i ' t ;
wji:
layer: previous laYer;
for each node j in tbis laYer do
for each node lc in_Put to j do
r_(U;),
Auht
,kj
Gradient (Itl,E) ;
A simple version of the backpropagation algorithm is -shown in
in
Atgorithm i.6. The MSE is used to calculate the error. Each tuple
algorithm
the training set is input to this algorithm. The last step of the
graph.
uses grad,iint d,escent as the technique to modify the weights in the
minthat
The Lasic idea of gradient descent is to find the set of weights
function
error
imizes the MSE. ffi
Si"ur the slope (or gradient) of the
wish to find the weight where this slope is zero.
i-hus
We
for one weight.
4.7 illustrate the concept. The stated algorithm
Algorithm
and
Figure 4.15
More hidden layers would be handled in
layer.
hidden
one
only
*r'o*",
the same manner with the error propagated backward'
E
Classificatiqr 107
descent
Ir(dr - y)winlm(l- gr);
rtyk-+
Upr' * Arirp3';
This algorithm changes weights by working backward frcim the output
layer to the input layer. There are two basic versions of this algorithm.
With the batch or offii,ne approach, the weights are changed once after all
tuples in the training set are applied and a total MSE is found. With the
incremental or online approach, the weights are changed after each tuple
in the training set is applied. The incremental technique is usually pre'
ferred because it requires less space and may actually examine more potential solutions (weights), thus leading to a better solution. In this equation,
4 is referred to as the learn'ing parameter. It typically is found in the
range (0, 1), although it may be larger. This value determines how fast the
algorithm learns.
Applying a learning rule back through multiple layers in the network
may be difficult. Doing this for the hidden layers is not as easy as doing
it with the output layer. Overall, however, we are trying to minimize the
error at the output nodes, not at each node in the network. Thus, the
approach that is used is to propagate the output errors backward through
the network.
Output
ukj
yi
------)
Desired weight
FIGURE4.15: Gradient descent'
Ar,conrtHrrl 4.7
InDuts
I\I
E
OrrtDut s
il
neural Detltork
//Starting
/ /Ertor found fron back algoritbn
//ImProved
neural
Detwork
FIGURE4.16: Nodes for gradient descent.
Figure 4.16 shows the structure we use to discuss the gradient descent
algorithm. Here node ri is at the output layer and node j is at the hidden layer just before it; grl is the output of i and U7 is the output of j.
The learning function in the gradient descent technique is based on using
the following value for delta at the output layer:
^
t\uii:
- r l AE
A_j,:
- n 0E }ui }Si
6 0 ,A & A W
(4.4r)
Chapter4
f.5.4
Perceptrons
The simplest NN is called a perceptron. A perceptron is a single neuron
with multiple inputs and one output. The original perceptron proposed the
use of a step activation function, but it is more common to seeanother type
of function such as a sigmoidal function. A simple perceptron can be used
to classify into two classes.Using a unipolar activation function, an output
of 1 would be used to classify into one class, while an output of 0 would be
used to pass in the other class. Example 4.11 illustrates this.
EXAMPLE
Classification lll
using an NN with only one hidden layer. In this case, the NN is to have
one input node for each attribute input, and given n input attributes the
hidden layer should have (2n * 1) nodes, each with input from each of the
input nodes. The output layer has one rrode for each desired output value.
4.6
RUtE-BASED ALGORITHMS
One straightforward way to perform classification is to generate if then
rules that cover all cases. For example, we could have the foilowin! rules
to determine classification of grades:
4.11
Figure 4.18(a) shows a perceptron with two inputs and a bias input. The
three weights are 3,2, and - 6, respectively. The activation function /a is
thus applied to the value S : 3rr f 2r2 - 6. Using a simple unipolar step
activation function, we get
/.:il :lf"iJ."
)
If 90 < grade, then class :
If 80 < grade and grade ( g0, then class :
B
If 70 < grade and grade ( 80,then class :
If 60 < grade and grade ( 70, then class :
D
If grade < 60, then class :
F
(4.55)
,4
C
A classi,fication rule, , : (o,c), consists of the if or anteced,ent, part
a,
and the then or consequent portion, c. The antecedentcontains a predicate
that can be evaluated as true or false against each tuple in the
database
(and obviously in the training data). These rules relate directly
to the corresponding DT that could be created. A DT can always be used to generate
rules, but they are not equivalent. There are differentes between
rules and
trees:
x2
r The tree has an implied order in which the splitting
is performed.
Rules have no order.
(a) Classificationperceptron
(b) Classification
problem
o A tree is created based on Iooking at all classes.
when generating
rules, orrly one class must be examirred at a time.
FIGURE4.18: Perceptron classification example.
An alternative way to view this classification problern is shown in
Figure 4.18(b). Here 11 is shown on the horizontal axis and 12 is shown on
the vertical axis. The area of the plane to the iight of the line rz : 3-3 /2rr
representsone class and the rest of the plane representsthe other class.
The simple feedforward NN that was introduced in Chapter 3 is
actually called a mult'ilayer perceptron (MLP). An Iv,ILP is a network
of perceptrons. Figure 3.6 showed an MLP used for classifying the height
example given in Table 4.1. The neurons are placed in layers with outputs
always flowing toward the output layer. If only one layer exists, it is called
a perceptron. If multiple layers exist, it is an MLP.
In the 1950s a Russian rnathematician, Andrey Kolmogorov, proved
that an MLP needs no more than two hidden layers. Kolmogorou's theorem states that a mapping between two sets of nunrbers can be perfornred
There are algorithms that generate rules from trees as well
as algorithms
that generate rules without first creating DTs.
4.6.1
Generating Rules from a DT
The orocess to generate a rule from a DT is straightforward
and is outlined
in Algorithm 4.8. This algorithm will generate a rule for each leaf
node in
the decision tree. All rgles with the same consequent
courd be combined
together by ORing the antecedentsof the sirnpler rules.
Ar,coRrrnv 4.8
flrDut !
T
OutDuts
R
/ /Decision tree
//RuIes
volo
rvrilrIrEr-rllLrequcfory
an(t AOVanced
Chapter4
I oplcs
and dividing cotrtinuous values into disjoint rarlges. The rule extraction
algorithm. RX. shos'n in Algorithm 4.9 is derived from [LSL95].
Gen algorithn:
sinple approach to g€nerating
//ILlustrate
rules fron a DT
classification
Arconrruv
R:0
for
each path fron root to a leaf in T do
o = True
f,or each non-leaf node do
a: oA (labe} of node corobined uitb labe1 of lncident
outgoj.ng arc)
c: Label of leaf node
R : R U 1 " : ( o .c )
Input:
D
l/
( 1.6 m), Short)
{((Height
(((Height > 1.6 m) A (Height S 1.7 rn)), Short)
(((Height > 1.7 m) n (Height < 1.8 m)), Nledium)
(((Height > 1.8 m) A (Height < 1.9 m)),I\Iedium)
(((Height > 1.9 m) A (Height < 2 tt) A (Height > 1.95 m)). Tall)
((Height > 2 m),Tall))
An optimized version of these rules is then:
{((Height < 1.7 m),Short)
(((Height > 1.7 m) A (Height < 1.95 m)), Medium)
((Height > 1.95 m), Tall))
4.6.2
Generating Rules from a Neural Net
To increase the understanding of an NN, classification rules may be derived
from it. While the source NN may still be used for classification, the derived
rules can be used to verify or interpret the network. The problem is that
the rules do not explicitly exist. They are buried in the stmcture of the
graph itself. In addition, if learning is still occurring, the rules themselves
are dynamic. The rules generated tend both to be more concise and to
have a lower error rate than rules used with DTs. The basic idea of the
RX algorithm is to cluster output values with the associated hidden nodes
and input. A major problem with rule extraction is the potential size that
these rules should be. For example. if you have a node with n inputs each
having 5 values, there are 5' different input combinations to this one node
alone. These patterns would all have to be accounted for when cotrstructing
rules. To overcome this problem and that of having continuous ranges of
output values from nodes, the output values for both the hidden and output
layers are first discretized. This is accomplished by clustering the valrres
4.9
/ /Trainj.ng data
neural
//Initial-
network
0utput:
R
/ /Derived rul-es
RX algorithm:
//HuIe extraction algorith.n to extract rules from NN
cluster output node activation values;
cluster hidden node activation values;
generate rules that descri.be the output varues in terns of
the hidden activation values;
generate rules that describe hidden output values in
terms of inputs;
combine the two sets of ru1es.
Using this algorithm, the following rules are generated for the DT in
Figure 4.12(a)t
(((Height > 1.9 m) A (Height S 2 m) A (Height < 1'95 m)), Nledium)
C l a s s i f i c a t i o nf 1 3
4.6.3
Generating Rules Without a DT or NN
These techniques are sornetimes called coueri,ng algorithms because they
attempt to generate rules exactly cover a specific class
[wF00]. Tl.ee
aigorithrns work in a top-down divide and conquer approach, br.rt tiris peed
uot be the case for co"'ering algorithms. They generate the best prle possible by' optirnizing the desired classification probabiiity. Usuaily tlie ,,best"
attributrvalue pair is chosen, as opposed to the best attriblte with the
tree-based algorithms. suppose that we wished to generate a rule to classifv persons as tall. The basic format for the rule is then
If? then class: tall
The objective for the covering algorithrns is to replace the ''?" in this
statement with predicates that can be used to obtain the '.best'' probability
of being tall.
one simple approach is called 1R because it generates zrsinrple set of
rules that are equivaient to a DT with onlv one level. The basic irlea is to
choosethe best attribute to perfonn the classification baseclon thertrainilg
data. "Best" is defined here by counting the nurnber of errors. Irr Table 4.,4
tliis approacir is illustrated usirig the height exanrple, outputl. If we o'iv
ttse tlre gender attribute, there are a total of 6/rb errors. whereas if we rrse
tlre ireight attribute, there are oilv rf lb. Thus. the heiglrt rvoulclhe chosen
and the six rules stated in the table would be use<I.As rvith ID3. 1R terrtls
to choose attributes with a large munber of valucs leadirrg to overfittilg.
1R can handle missing data by adding an additiontrl attlibutt: r,4[rerfor.the
valrre of rnissing. Aigorithui 4.10. which is adapted fr<.rru[\\'-F(x)]. slxrws
the outline for this algorithm.
ChaPter4
14
Classification 115
Data Mining-lntroductory and AdvancedTopics
1.7 < Height (:
1.8 < Height (:
TABLE 4.4: 1R Classification
Option
Attribute
Rules
Errors
Gender
F - Medium
l1'l+ Tall
(0, 1.6]- Short
- Short
(1.6.1.71
(1.7,1.8]---+
Medium
* Medium
(1.8,1.9]
* Medium
(1.9,2.0]
(2.0,m)
Tali
3le
316
012
012
013
014
Height
n
c
r12
012
212
If 2.0 < height, then class : tall
Since all tuples that satisfy this predicate are tall, we do not add any
additional predicates to this rule. We now need to generate additional
rules for the tall class. We thus look at the remaining 13 tuples in the
training set and recalculate the accuracy of the corresponding predicates:
Gender: .lvf
Height (: 1.6
rules
//Rules
1R algorithm:
//tR al'gotithm generates rules based on one attribu
ll2
Based on this analysis, we would generate the rule
tl]5
Gender: F
//Training data
to consider for
//Attrlbttes
//Classes
each 4€R
0/4
2.0 < Height
6lr5
Output:
for
0/3
1.9
1 . 9< H e i g h t 1 : 2 . 0
Total Errors
Ar,conrtnv 4 . L O
Input:
D
1.8
1.6< Height 1:
I.7
1.7< Height (:
1.8< Height (:
1.8
1.9
1 . 9 < H e i g h t< : 2 . 0
0ls
rl4
012
012
ol3
014
r12
Based on the analysis, we seethat the last height range is the most accurate
and thus generate the rule:
do
RA: 0;
for each possible value, u, of 4 do
value
//u naY be a range rather than a s p e c i f i c
for each Ci e C find count(Ci);
/ / Here count is tbe number of occurrences
class for this attribute
let Cp be the class with the largest count;
RA: Pl U ((A: v) * (class : Cn));
by
: nunber of tuples lncorrectly
classified
ERRr4
is
nininun;
R: Ra where ERRr4
If 2.0 < height, then class: tall
However, only one of the tuples that satisfiesthis is actually tall, so we need
to add another predicate to it. We then look only at the other predicates
affecting these two tuples. We now see a problem in that both of these are
males. The problem is actually caused by our "arbitrary" range divisions.
We now divide the range into two subranges:
1.9< Height (:
1.95
1 . 9 5< H e i g h t < : 2 . 0
0/L
lll
We thus add this second.predicateto the rule to obtain
EXAMPLE 4.I2
If 2.0 < height and 1.95 < height 1:2.O,then
Using the data in Table 4.1 and the Outputl classification, the fol
shows the basic probability of putting a tuple in the tall class based on
given attribute-value Pair:
Gender:F
Gender: M
Height(: 1.6
1.6< Height1: L.7
019
316
ol2
Ol2
or
class: tall
If 1.95 < height, then class: tall
This problem does not exist if we look at tuples individually using the
attribute-value pairs. However, in that case we would not generate the
needed ranges for classifying the actual data. At this point, we have classified all tall tuples. The algorithm would then proceed by classifying the
short and medium classes.This is left as an exercise.
tt,
rrdLdrvilntnt-rnrrooucloryano AovanceclI oprcs
Chapter4
Classification Ll7
Alconnxnr 4.11
Pt(Ci | 11) for each class. The values are cornbinc'dwith a u'eighted linear
conrbination
Input:
D
C
/,/Training
//Classes
data
\w*P*(C,
0utput:
ll
/ / Kules
PRISM algorithrn:
//PRISM algorith.m generates rules based on best
attributevalue
pairs
R: A;
for
.
each Ci € C do
repeat
T: D; //A11 instances of class Ci wi.ll be systenatical.I
renoved from T
p : true:' / /Create new rule with enpty left-hand side
r: (if p then C4);
Here the weights. u'a. can be a^ssigned
b1' a user or learned based on the past
accuracy ofeach classifier. Another technique is to choosethe ciassifierthat
lras tlre best accuracy in a database sample. T'his is referred to as a dyrt.am,i,r:
classi,fierselection (DCS). Example 4.13. whicli is modified from lLJ98],
illustrates the use of DCS. Another valiatiou is sirnple voting: assign the
tuple to the class to which a majoritv of the classifiers have assigned it.
This may have to be rnodified slightly in case there are manv classesan<l
rio rnajority is found.
o
a Tuplein Class1
and correctlyclassified
repeat
for
each attribute
A value v pair
found in T do
nn
r Tuplein Class1
and incorrectlyclassified
ox
calculate(ffi),
I
f ind {: o that
p:pA(A:tt)i
7: {tuples in
unt i I all tuples
D:D-T;
R:RUr;
until
there are no
(a.56)
lt1)
A:1
naxi.ni,zes this
value;
I that satisfy A:u};
iu I belong to C1;
AO
(a) ClassifierI
(b) Classifier2
o Tuplein Class2
and correctlyclassified
r Tuple in Class2
and incorrectlyclassified
FIGURE4.19: Combination of multiple classifiers.
tuples in D that belong to Ci;
.7 COMBINING TECHNIQUES
Given a classification problem, no one classification technique always yields
tire best results. Therefore, there have been some proposals that look at
conrbining techniques.
o A synthesis of approachestakes rnultiple techrriquesand blends thern
into a new approach. An example of this would be using a prediction
technique, such as linear regression, to predict a future value for an
attribute that is then used as input to a classification NN. In this way
the NN is used to predict a future classification value.
o \tlrrltiple independenl approaches can be applied to a classification
problem. each yielding its own class prediction. The results of these
individual techniques can then be combined in some rnanner. This
approachhas been referredto as combination of rnulti,ple classifiers
(CMC).
One approach to conrbine independent classifiersassumesthat thele ;11's
rz indeperrdent classifiersand that each generates the posterior probability
EXAMPLE 4.13
Two classifiersexist to classify tuples into two classes. A target tuple. X,
needs to be classified. Using a nearest neighbor approach, the 10 tuples
closest to X are identified. Figure 4.19 shows the 10 tuples closest to
X, In Figure 4.19(a) the results for the first classifier are shown, while
in Figure 4.19(b) those for the second classifier are shown. The tuples
designated with triangles should be in class 1, while those shown as squares
should be in class 2. Any shapes that are darkened indicate an incorrect
classifi.cationby that classifier. To combine the classifiersusiug DCS, look
at the general accuracy of each classifier. with classifier r,T tuples in the
neighborhood of X are correctly classified,while with the second classifier,
only 6 are correctly classified. Thus, X will be classified according to how
it is classifiedwith the first classifier.
118
Data Mining-lntroductory and AdvancedTopics
4.8
REVIEW QUESTIONS
1.
What are the issuesin classification? Explain with an example.
2.
Give an example for classification using division.
3.
Give an example for classification using prediction.
4.
Explain with an example Bayesian classification.
5.
What do you mean by distance-based algorithms?
6.
Explain decision tree-based algorithms.
7.
What do you mean by entropy?
8.
What do you mean by CARIT?
9.
Explain scalable DT techniques.
CHAPTER
5
Clustering
5.1
INTRODUCTION
5.2
SIMILARITY AND DISTANCE MEASURES
5.3
OUTLIERS
5.4
HIERARCHICALALGORITHMS
5.5
PARTITIONAL ALGORITHMS
5.6
CLUSTERINGLARGE DATABASES
10.
Explain NN-based (neural network) algorithms.
5.7
CLUSTERINGWITH CATEGORICATATTRIBUTES
11.
What do you mean by rule-based algorithms?
5.8
COMPARISON
12.
Explain combining techniques.
5.9
REVIEW QUESTIONS
13.
How do you generate rules from a neural net?
5.1 INTRODUCTION
Clustering is similar to classification in that data are grouped. However,
unlike classification, the groups are not predefined. Instead, the grouping
is accomplished by finding similarities between data according to characteristics found in the actual data. The groups are called clusters. Some
authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many
definitions for clusters have been proposed:
o Set of like elements. Elements from different clusters are not alike.
o The distance between points in a cluster is less than the distance
between a point in the cluster and any point outside it.
A term similar to clustering is database segmentat'ion, where like tuples
(records) in a database are grouped together. This is done to partition
or segment the database into components that then give the user a more
general view of the data. In this text, we do not differentiate between
segmentation and clustering. A simple example of clustering is found in
Example 5.1. This example illustrates the fact that determining how to do
the clustering is not straightforward.