Download Temporary

Document related concepts

Computer simulation wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Pattern recognition wikipedia , lookup

Theoretical computer science wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
Data Warehousing and
Data Mining
— Chapter 7 —
MIS 542
2013-2014 Fall
1
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
2
Classification vs. Prediction



Classification:
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
 models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
3
Classification—A Two-Step Process


Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
4
Classification Process (1): Model
Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
5
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
6
Supervised vs. Unsupervised
Learning

Supervised learning (classification)



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)


The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
7
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
8
Issues Regarding Classification and Prediction
(1): Data Preparation

Data cleaning


Relevance analysis (feature selection)


Preprocess data in order to reduce noise and handle
missing values
Remove the irrelevant or redundant attributes
Data transformation

Generalize and/or normalize data
9
Issues regarding classification and prediction
(2): Evaluating Classification Methods






Predictive accuracy
Speed and scalability
 time to construct the model
 time to use the model
Robustness
 handling noise and missing values
Scalability
 efficiency in disk-resident databases
Interpretability:
 understanding and insight provided by the model
Goodness of rules
 decision tree size
 compactness of classification rules
10
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
11
Classification by Decision Tree
Induction



Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
12
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
13
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
14
Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
15
Attribute Selection Measure:
Information Gain (ID3/C4.5)



Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
arbitrary tuple
m
I( s1,s2,...,sm )  
i 1

si
si
log 2
s
s
entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
I ( s1 j ,..., smj )
s
j 1
v
E(A)  

information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
16
Attribute Selection by Information
Gain Computation




Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:
age
<=30
30…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
5
4
I (2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.694
14
E ( age) 
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  I ( p, n)  E (age)  0.246
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
17
Gain Ratio







Add another attribute transaction TID
for each observation TID is different
E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+
(1/14)*I(1,0)=0
gain(TID)= 0.940-0=0.940
the highest gain so TID is the test attribute
 which makes no sense
use gain ratio rather then gain
Split information: measure of the information value of
split:
 without considering class information
 only number and size of child nodes
 A kind of normalization for information gain
18





Split information = (-Si/S)log2(Si/S)
 information needed to assign an instance to one of
these branches
Gain ratio = gain(S)/split information(S)
in the previous example
Split info and gain ratio for TID:
 split info(TID) = [(1/14)log2(1/14)]*14=3.807
 gain ratio(TID) = (0.940-0.0)/3.807=0.246
Split info for age:I(5,4,5)=
 (5/14)log25/14+ (4/14)log24/14
+(5/14)log25/14=1.577
 gain ratio(age) = gain(age)/split info(age)

= 0.247/1.577=0.156
19
Exercise

Repeat the same exercise of constructing the tree
by the gain ratio criteria as the attribute selection
measure
 notice that TID has the highest gain ratio
 do not split by TID
20
Other Attribute Selection Measures

Gini index (CART, IBM IntelligentMiner)




All attributes are assumed continuous-valued
Assume there exist several possible split values for
each attribute
May need other tools, such as clustering, to get the
possible split values
Can be modified for categorical attributes
21
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as
gini(T ) 1
p2

j 1

j
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
N 1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N

The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
22
Gini index (CART) Example

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2
2
9 5
gini ( D)  1        0.459
 14   14 

Suppose the attribute income partitions D into 10 in D1: {low, medium}
 10 
4
and 4 in D2
giniincome{low,medium} ( D)   Gini ( D1 )   Gini( D1 )
 14 
 14 
but gini{medium,high} is 0.30 and thus the best since it is the lowest

D1:{medium,high}, D2:{low} gini index: 0.300

D1:{low,high}, D2:{medium} gini index: 0.315

Highest gini is for D1:{low,medium}, D2:{high}
23
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a conjunction

The leaf node holds the class prediction

Rules are easier for humans to understand

Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
24
Approaches to Determine the Final
Tree Size

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validation

Use all the data for training


but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
Use minimum description length (MDL) principle

halting growth of the tree when the encoding is
minimized
25
Enhancements to basic decision
tree induction

Allow for continuous-valued attributes



Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values
Attribute construction


Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
26
Missing Values





T cases for attribute A
in F cases value of A is unknown
gain = (1-F/T)*(info(T)-entropy(T,A)+
(F/T)*0
split info add another branch for cases whose
values are unknown
27
Missing Values



When a case has a known attribute value it s
assigned to Ti with probability 1
if the attribute value is missing it is assigned to Ti
with a probability
 give a weight w that the case belongs to subset
Ti
 do that for each subset Ti
Than number of cases in each Ti has fractional
values
28
Example

One of the training cases has a missing age









T:(age = ?, inc=mid,stu=no,credit=ex,class buy= yes)
gain(age) = inf(8,5)-ent(age)=
inf(8,5)=-(8/13)*log(8/13)-(5/13)*log(5/13)
=0.961
ent(age)=(5/13)*[-(2/5)*log(2/5)-(3/5)*log(3/5)]
+(3/13)*[-(3/3)*log(3/3)+0]
+(5/13)*[-(3/5)*log(3/5)-(2/5)*log(2/5)]
=.747
gain(age)= (13/14)*(0.961-0.747)=0.199
29






split info(age) = (5/14)log5/14
<=30
+ (3/14)log3/14
31..40
+ (5/14)log5/14
>40
+ (1/14)log1/14
missing
=1.809
gain ratio(age) = 0.199/1.809=0.156
30









after splitting by age
age <=30 branch
age
student
age<30
y
age<30
n
age<30
n
age<30
n
age<30
y
age<30
n
age
<=30
student
no
3N
5/13 B
class
B
N
N
N
B
B
31..40
weight
1
1
1
1
1
5/13
>40
credit
3,3/13 B
yes
2B
fair
3B
ext
2N
5/13B
31










What happens if a new case has to be classified
 T: age<30,income=mid,stu=?,credit=fair,
class=has to be found
based on age goes to first subtree
but student is unknown
 with (2.0/5.4) probability it is a student
 with (3.4/5.4) probabilirty it is not a student
 (5/13 is approximately 0.4)
P(buy)= P(buy|stu)*P(stu) + P(buy|nostu)*P(notst)
=
1
*(2/5.4) +(5/44)
*(3.4/5.4)
= 0.44
P(nbuy)=P(nbuy|stu)*P(stu) + P(nbuy|nostu)*P(notst)
= 0
*(2/5.4) +
(39/44)*(3.4/5.4)
=0.56
32
Continuous variables





Income is continuous
make a binary split
try all possible splitting points
compute entropy and gain similarly
but you can use income as a test variable in
any subtree
 there is still information in income not used
perfectly
33
Avoid Overfitting in Classification


Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”
34
Motivation for pruning




A trivial tree two classes with probability p: no
1-p:yes
 conditioned on a given set of attribute values
(1) assign each case to majority: no
 error(1): 1-p
(2) assign to no with p and yes with 1-p
 error(2): p*(1-p) + (1-p)*p=2p(1-p)>1-p


for p>0.5
so simple tree has a lower error of classification
35




If the error rate of a subtree is higher then the
error obtained by replacing the tree with its most
frequent leaf or branch
 prune the subtree
How to estimate the prediction error
do not use training samples
 pruning always increase error of the training
sample
estimate error based on test set
 cost complexity or reduced error pruning
36
Pessimistic error estimates




Based on training set only
subtree covers N cases E cases missclasified
error based on training set f=E/N
 but this is not error
 resemble it as a sample from population
 estimate a upper bond on the population error
based on the confidence limit
Make a normal approximation to the binomial
distribution
37






Given a confidence level c default value is %25 in
C4.5
find confidence limit z such that
P((f-e)/sqrt(e(1-e)/N)>z)=c
 N: number of samples
 f =E/N: observed error rate
 e: true error rate
the upper confidence limit is used as a pesimistic
estimate of the true but unknown error rate
first approximation to confidence interval for error
f +/-zc/2=f +/-zc/2sqrt(f(1-f)/N))
38





Solving the above inequality
e= (f+z2/2N + z*sqrt(f/N-f2/2N+z2/4N2))
(1+z2/N)
z is number of standard deviations corresponding
to a confidence level c
 for c=0.25 z=0.69
refer to Figure 6.2 on page 166 of WF
39
Example

Labour negotiation data
 dependent variable or class to be predicted:


independent variables:





acceptability of contract: good or bad
duration,
wage increase 1th year: <%2.5, >%2.5
working hours per week: <36, >36
health plan contribution: none, half, full
Figure 6.2 WF shows a branch of a d.tree
40
wage increase 1th year
>2.5
<2.5
working hours per week
<=36
1 Bad
1 good
>36
health plan contribution
none
4 bad
2 good
a
half
1 bad
1good
b
full
4 bad
2 good
c
41

for node a



E=2,N=6 so f=2/6=0.33
plugging into the formula upper confidence limit : e= 0.47
use 0.47 a pessimistic estimate of the error


for node b







rather then the training error of 0.33
E=1,N=2 so f=1/2=0.50
plugging into the formula upper confidence limit : e= 0.72
for node c f=2/6=0.33 but e=0.47
average error=(6*0.47+2*0.72+6*0.47)/14=0.51
The error estimate for the parent healt plan is
f=5/14 and f=0.46<0.51
 so prune the node
now working hour per week node has two branchs
42
working hour per week
1 bad
1 good
bed
e=0.46
e=0.72
average pessimistic error= (2*0.72+14*0.46)/16=
the pessimistic error of the pruned tree:
f= 6/16 ep=
Exercise calculate pessimistic error decide to prune
or not based on ep
43
Extracting Classification Rules from Trees






Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
age = “<=30” AND student = “no” THEN buys_computer = “no”
age = “<=30” AND student = “yes” THEN buys_computer = “yes”
age = “31…40”
THEN buys_computer = “yes”
age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
IF
IF
IF
IF
44









Rule R: A then class C
Rule R-:A- then class C
delete condition X from A to obtain Amake a table
class C
not class C
X Y1
E1
not X Y2
E2
Y1+E1 satisfies R Y1 correct E1 misclassified
Y2+E2 cases satisfied by R- but not R
45





The total cases by R- is Y1+Y2+E1+E2
 some satisfies X some not
use a pessimistic estimate of the true error for
each rule using the upper limit UppC%y(E,N)
for rule R estimate UppC%y(E1,Y1+E1)
and for R- estimate
UppC%y(E1+E2,Y1+E1+Y2+E2)
if pessimistic error rate of R- < that of R
 delete condition x
46





Suppose n conditions for a rule
delete conditions one by one
repeat
 compare with the pessimistic error of the
original rule
 if min(R-)<R delete condition x
unit no improvement is in pessimistic error
Study example on page 49-50 of Quinlan 93
47
AnswerTree

Variables









measurement levels
case weights
frequency variables
Growing methods
Stopping rules
Tree parameters
 costs, prior probabilities, scores and profits
Gain summary
Accuracy of tree
Cost-complexity pruning
48
Variables





Categorical Variables
 nominal or ordinal
Continuous Variables
All grouping method accept all types of variables
 QUEST requires that tatget variable be nominal
Target and predictor variables
 target variable (dependent variable)
 predictor (independent variables)
Case weight and frequency variables
49
Case weight and frequency variables



CASE WEIGHT VARIABLES
unequal treatment to the cases
Ex: direct marketing



all responders but %1 nonresponders(10,000)



10,000 households respond
and 1,000,000 do not respond
case weight 1 for responders and
case weight 100 for nonresponders
FREQUENCY VARIABLES
 count of a record representing more than one
individual
50
Tree-Growing Methods




CHAID
 Chi-squared Automatic Interaction Detector
 Kass (1980)
Exhaustive CHAID
 Biggs, de Ville, Suen (1991)
CR&T
 Classification and Regression Trees
 Breiman, Friedman, Olshen, and Stone (1984)
QUEST
 Quick, Unbiased, Efficient Statistical Tree
 Loh, Shih (1997)
51
CHAID




evaluate all values of a potential predictor
merges values that are judged to be statistically
homogenous
target variable
 continuous F test
 categorical Chi-square test
not binary: produce more than two categories at
any particular level
 wider tree than do the binary methods
 works for all types of variables
 case weights and frequency variables
 missing values as a single valid category
52
CHAID Algorithm


(1) for each predictor X
 find pair of categories of X least significantly different
(has the largest p value) wrt target variable Y
 Y cont use F test
 Y nominal for a two-way cross tabulation with
categories of X as rows categories of Y a columns use
the Chi-squared test
(2) for the pair of categories of X with largest p value
 p > merge (pre specified level) than



merge this pair into a single compound category
go to step (1)
p <= merge (pre specified level) than

go to step (3)
53
CHAID Algorithm (cont)


(3) compute adjusted p values
(4) select the predictor X that has the smallest
adjusted p value (most significant)
 compare its p value to a split (pre specified
splitting level)



P<= split split the node based on categories of X
P > split do not split the node (terminal node)
continue the tree growing process until the
stopping rules are made
54
CHAID Example
cat
#
1
35
2
8
3
35
4
22
total
100
Y : output has 4 categories
1,2,3,4
X1 and X2 are inputs
X1:1,2,3,4
X2: 0,1,2
X1/Y
1
2
3
4
raw
total
1
23
5
19
4
51
2
12
2
15
13
42
3
0
1
0
1
2
4
0
0
1
4
5
col totl
35
8
35
22
100
Chi^2=25.63 df=9 (p=0.002342)
55
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
X1/Y
1
2
3
4
taw
total
2
12
2
15
13
42
2
12
2
15
13
42
cal total
35
7
34
17
93
4
0
0
1
4
5
cal total
12
2
16
17
47
Chi^2=9.19 df=3 (p=0.002682)
Chi^2=4.96 df=3 (p=0.174565)
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
3
0
1
0
1
2
X1/Y
2
3
4
taw
total
cal total
23
6
19
5
53
3
1
0
1
2
4
0
1
4
5
cal total
1
1
5
7
Chi^2=8.01 df=3 (p=0.045614)
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
4
0
0
1
4
5
cal total
23
5
20
8
56
Chi^2=3.08 df=2 (p=0.214381)
Chi^2=19.72 df=3 (p=0.000193)
X1/Y
1
2
3
4
taw total
2
12
2
15
13
42
3
0
1
0
1
2
cal total
12
3
15
14
4
Chi^2=7.23 df=3 (p=0.064814)
56
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
3,4
0
1
1
5
7
cal total
23
6
20
9
58
Chi^2=20.25 df=3 (p=0.000150)
X1/Y
1
2
3
4
taw total
2
12
2
15
13
42
3,4
0
1
1
5
7
cal total
12
3
16
18
49
Chi^2=6.40 df=3 (p=0.093339)
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
2,3,4
12
3
16
18
49
cal total
35
8
35
22
100
Chi^2=13.08 df=3 (p=0.004448)
57
X2/Y
1
2
3
4
raw total
0
33
8
34
22
97
X2/Y
1
3
raw total
1
1
0
1
0
2
1
1
1
2
2
1
0
0
0
1
2
1
0
1
col totl
35
8
35
22
100
col totl
2
1
3
Chi^2=0.75 df=1 (p=0.386476)
Chi^2=2.76 df=6 (p=0.837257)
X2/Y
1
2
3
4
raw total
0
33
8
34
22
97
1
1
0
1
0
2
col totl
34
8
35
22
99
Chi^2=0.88 df=3 (p=0.828296)
X2/Y
1
2
3
4
raw total
0
33
8
34
22
97
2
1
0
0
0
1
col totl
34
8
34
22
98
X2/Y
1
2
3
4
raw
total
0,1
34
8
35
22
99
2
1
0
0
0
1
col totl
35
8
35
22
100
Chi^2=1.87 df=3 (p=0.598558)
Chi^2=1.90 df=3 (p=0.593045)
58
Exhaustive CHAID





CHAID may not find the optimal split for a variable as
it stop merging categories
 all are statistically different
continue merging until two super categories are left
examine series of merges for the predictor
 set of categories giving strongest association with
the target
 computes and adjusted p values for that
association
best split for each predictor
 choose predictor based on p values
otherwise identical to CHAID
 longer to compute
 safer to use
59
CART


Binary tree growing algorithm
 may not present efficiently
partitions data into two subsets
 same predictor variable may be used several
times at different levels
 misclassification costs
 prior probability distribution
 Computation can take a long time with large
data sets
 Surrogate splitting for missing values
60
Impurity measures (1)



for categorical targets variables
 Gini, twoing, or (ordinal targets) ordered
twoing
for continuous targets
 least-squared deviation (LSD)
Gini
 at node t, g(t):
 g(t) = jip(j|t)*p(i|t) i and j are categories
2
 g(t) = 1-jp (j|t)
 when cases are evenly distributed g takes its
max value of 1-1/k k: number of categories
61
Impurity measures (2)





if costs are specified
g(t) = jiC(i|j)*p(j|t)*p(i|t)
C(i|j) specifies cost of misclassifiying a category
j case as category i
gini criterion function for split s at node t:
(s,t) = g(t) - pLg(tL) - pRg(tR)




PL proportion of cases in t sent to left child node
PR proportion of cases in t sent to right child node
split s maximizing the value of (s,t) is choosen
improvement in the tree
62
Impurity measures (3)








Twoing
splitting into two superclasses
best split on the predictor based on those two
superclasses
(s,t) = pLpR[j|p(j|tL)-p(j|tR)]2
the split s is chosen that maximizes this criterion
C1 : [j: p(j|tL)>=p(j|tR)]
C2= C - C1
costs are not taken into account
63
Impurity measures (4)


Ordered Twoing
 for ordinal target variables
 contiguous categories can be combined to form
superclasses
LSD
 continuous targets
 within node variance for node t
2
 R(t)= (1/Nw(t))itwnfn(yi-y_bar(t))
 where
 Nw weighted number of cases in t




wn is value of weighting variable for case I
fn is frequency variabl
yi target variable
y_bar(t) weighted mean for t
64



LSD criterion function
(s,t) = R(t) - pLR(tL) - pRR(tR)
this value weighted by the proportion of all
cases in t is the value reported as improvement
in the tree
65
Steps in the CART analysis




at the root node t=1, search for a split s*
 giving the largest decrease in impurity
 (s*,1) = max sS(s,1)
split 1 as t=2 and t=3 using s*
repeat the split searching process in t=2 and t=3
continue until one of the stopping rules is met
66
Stoping rules






all cases have identical values for all predictors
the node becomes pure; all cases have the same
value of the target variable
the depth of the tree has reached its prespecified
maximum value
the number of cases in a node less than a
prespecified minimum parent node size
split at node results in producing a child node with
cases less than a prespecified min child node size
for CART only: max decrease in impurity is less
than a prespecified value
67
QUEST


variable selection and split point selection
separately
computationally efficient than CART
68
Classification in Large Databases



Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?




relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
69
Scalable Decision Tree Induction
Methods in Data Mining Studies




SLIQ (EDBT’96 — Mehta et al.)
 builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
 constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
 integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that
determine the quality of the tree
 builds an AVC-list (attribute, value, class label)
70
Data Cube-Based Decision-Tree
Induction


Integration of generalization with decision-tree induction
(Kamber et al’97).
Classification at primitive concept levels




E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification

Relevance analysis at multi-levels.

Information-gain analysis with dimension + level.
71
Chapter 7. Classification and Prediction











What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
72
Bayesian Classification: Why?




Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
73
Bayesian Theorem: Basics






Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given that
the hypothesis holds
74
Bayesian Theorem

Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )


Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH

Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
75
Naïve Bayes Classifier

A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x k | C i )
k 1




The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count the class
distribution.
Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
76
Example






H X is an apple
P(H) priori probability that X is an apple
X observed data round and red
P(H/X) probability that X is an apple given that we
observe that it is red and round
P(X/H) posteriori probability that a data is red and
round given that it is an apple
P(X) priori probabilility that it is red and round
77
Applying Bayesian Theorem








P(H/X) = P(H,X)/P(X) from Bayesian theorem
Similarly:
P(X/H) = P(H,X)/P(H)
P(H,X) = P(X/H)P(H)
hence
P(H/X) = P(X/H)P(H)/P(X)
calculate P(H/X) from
P(X/H),P(H),P(X)
78
Bayesian classification





The classification problem may be formalized
using a-posteriori probabilities:
P(Ci|X) = prob. that the sample tuple
X=<x1,…,xk> is of class Ci.
There are m classes Ci i =1 to m
E.g. P(class=N | outlook=sunny,windy=true,…)
Idea: assign to sample X the class label Ci such
that P(Ci|X) is maximal
P(Ci|X)> P(Cj|X) 1<=j<=m ji
81
Estimating a-posteriori probabilities

Bayes theorem:
P(Ci|X) = P(X|Ci)·P(Ci) / P(X)

P(X) is constant for all classes

P(Ci) = relative freq of class Ci samples

Ci such that P(Ci|X) is maximum =
Ci such that P(X|Ci)·P(Ci) is maximum

Problem: computing P(X|Ci) is unfeasible!
82
Naïve Bayesian Classification




Naïve assumption: attribute independence
P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci)
If i-th attribute is categorical:
P(xi|Ci) is estimated as the relative freq of
samples having value xi as i-th attribute in class
Ci =sik/si .
If i-th attribute is continuous:
P(xi|Ci) is estimated thru a Gaussian density
function
Computationally easy in both cases
83
Training dataset
age
Class:
<=30
C1:buys_computer= <=30
‘yes’
30…40
C2:buys_computer= >40
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
84
Solution
Given the new customer
What is the probability of buying computer
X=(age<=30 ,income =medium,
student=yes,credit_rating=fair)
Compute
P(buy computer = yes/X) and
P(buy computer = no/X)
Decision:
list as probabilities or
chose the maximum conditional probability
85
Compute
P(buy computer = yes/X) =
P(X/yes)*P(yes)/P(X)
P(buy computer = no/X)
P(X/no)*P(no)/P(X)
Drop P(X)
Decision: maximum of
 P(X/yes)*P(yes)
 P(X/no)*P(no)
86
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
 P(X/C = yes)*P(yes)

P(age=“<30” | buys_computer=“yes”)*
P(income=“medium” |buys_computer=“yes”)*
P(credit_rating=“fair” | buys_computer=“yes”)*
P(student=“yes” | buys_computer=“yes)*
P(C =yes)
87

P(X/C = no)*P(no)
P(age=“<30” | buys_computer=“no”)*
P(income=“medium” | buys_computer=“no”)*
P(student=“yes” | buys_computer=“no”)*
P(credit_rating=“fair” | buys_computer=“no”)*
P(C=no)
88
Naïve Bayesian Classifier: Example
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
P(buys_computer=“yes”)=9/14=0,643
P(buys_computer=“no”)=5/14=0,357
89
P(X|buys_computer=“yes”)
= 0.222 x 0.444 x 0.667 x 0.0.667 = 0.044
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)
=0.044*0.643=0.02
P(X|buys_computer=“no”)
= 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|buys_computer=“no”) * P(buys_computer=“no”)
=0.019*0.357=0.0007
X belongs to class “buys_computer=yes”
90
Summary of calculations

Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)
=0.044*0.643=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
=0.019*0.357=0.0007
X belongs to class “buys_computer=yes”
91
Naïve Bayesian Classifier: Comments



Advantages :
 Easy to implement
 Good results obtained in most of the cases
Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
How to deal with these dependencies?
 Bayesian Belief Networks
92
Bayesian Networks

Bayesian belief network allows a subset of the variables
conditionally independent

A graphical model of causal relationships


Represents dependency among the variables
Gives a specification of joint probability distribution
Y
X
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles
93
Bayesian Belief Network: An Example
Family
History
Smoker
(FH, S)
LungCancer
PositiveXRay
Emphysema
Dyspnea
Bayesian Belief Networks
(FH, ~S) (~FH, S) (~FH, ~S)
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
n
P( z1,..., zn ) 
 P( z i | Parents( Z i ))
i 1
94
Learning Bayesian Networks


Several cases
 Given both the network structure and all variables
observable: learn only the CPTs
 Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
 Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
 Unknown structure, all hidden variables: no good
algorithms known for this purpose
D. Heckerman, Bayesian networks for data mining
95