Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehousing and
Data Mining
— Chapter 7 —
MIS 542
2013-2014 Fall
1
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
2
Classification vs. Prediction
Classification:
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
 models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
3
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
4
Classification Process (1): Model
Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
5
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
6
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
7
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
8
Issues Regarding Classification and Prediction
(1): Data Preparation
Data cleaning
Relevance analysis (feature selection)
Preprocess data in order to reduce noise and handle
missing values
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
9
Issues regarding classification and prediction
(2): Evaluating Classification Methods
Predictive accuracy
Speed and scalability
 time to construct the model
 time to use the model
Robustness
 handling noise and missing values
Scalability
 efficiency in disk-resident databases
Interpretability:
 understanding and insight provided by the model
Goodness of rules
 decision tree size
 compactness of classification rules
10
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
11
Classification by Decision Tree
Induction
Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
12
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
13
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
14
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
15
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
arbitrary tuple
m
I( s1,s2,...,sm )  
i 1
si
si
log 2
s
s
entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
I ( s1 j ,..., smj )
s
j 1
v
E(A)  
information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
16
Attribute Selection by Information
Gain Computation
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:
age
<=30
30…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
5
4
I (2,3) 
I ( 4,0)
14
14
5
I (3,2)  0.694
14
E ( age) 
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  I ( p, n)  E (age)  0.246
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
17
Gain Ratio
Add another attribute transaction TID
for each observation TID is different
E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+
(1/14)*I(1,0)=0
gain(TID)= 0.940-0=0.940
the highest gain so TID is the test attribute
 which makes no sense
use gain ratio rather then gain
Split information: measure of the information value of
split:
 without considering class information
 only number and size of child nodes
 A kind of normalization for information gain
18
Split information = (-Si/S)log2(Si/S)
 information needed to assign an instance to one of
these branches
Gain ratio = gain(S)/split information(S)
in the previous example
Split info and gain ratio for TID:
 split info(TID) = [(1/14)log2(1/14)]*14=3.807
 gain ratio(TID) = (0.940-0.0)/3.807=0.246
Split info for age:I(5,4,5)=
 (5/14)log25/14+ (4/14)log24/14
+(5/14)log25/14=1.577
 gain ratio(age) = gain(age)/split info(age)
= 0.247/1.577=0.156
19
Exercise
Repeat the same exercise of constructing the tree
by the gain ratio criteria as the attribute selection
measure
 notice that TID has the highest gain ratio
 do not split by TID
20
Other Attribute Selection Measures
Gini index (CART, IBM IntelligentMiner)
All attributes are assumed continuous-valued
Assume there exist several possible split values for
each attribute
May need other tools, such as clustering, to get the
possible split values
Can be modified for categorical attributes
21
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as
gini(T ) 1
p2
j 1
j
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
N 1 gini( )  N 2 gini( )
(
T
)
gini split
T1
T2
N
N
The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
22
Gini index (CART) Example
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2
2
9 5
gini ( D)  1        0.459
 14   14 
Suppose the attribute income partitions D into 10 in D1: {low, medium}
 10 
4
and 4 in D2
giniincome{low,medium} ( D)   Gini ( D1 )   Gini( D1 )
 14 
 14 
but gini{medium,high} is 0.30 and thus the best since it is the lowest
D1:{medium,high}, D2:{low} gini index: 0.300
D1:{low,high}, D2:{medium} gini index: 0.315
Highest gini is for D1:{low,medium}, D2:{high}
23
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
24
Approaches to Determine the Final
Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
Use minimum description length (MDL) principle
halting growth of the tree when the encoding is
minimized
25
Enhancements to basic decision
tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
26
Missing Values
T cases for attribute A
in F cases value of A is unknown
gain = (1-F/T)*(info(T)-entropy(T,A)+
(F/T)*0
split info add another branch for cases whose
values are unknown
27
Missing Values
When a case has a known attribute value it s
assigned to Ti with probability 1
if the attribute value is missing it is assigned to Ti
with a probability
 give a weight w that the case belongs to subset
Ti
 do that for each subset Ti
Than number of cases in each Ti has fractional
values
28
Example
One of the training cases has a missing age
T:(age = ?, inc=mid,stu=no,credit=ex,class buy= yes)
gain(age) = inf(8,5)-ent(age)=
inf(8,5)=-(8/13)*log(8/13)-(5/13)*log(5/13)
=0.961
ent(age)=(5/13)*[-(2/5)*log(2/5)-(3/5)*log(3/5)]
+(3/13)*[-(3/3)*log(3/3)+0]
+(5/13)*[-(3/5)*log(3/5)-(2/5)*log(2/5)]
=.747
gain(age)= (13/14)*(0.961-0.747)=0.199
29
split info(age) = (5/14)log5/14
<=30
+ (3/14)log3/14
31..40
+ (5/14)log5/14
>40
+ (1/14)log1/14
missing
=1.809
gain ratio(age) = 0.199/1.809=0.156
30
after splitting by age
age <=30 branch
age
student
age<30
y
age<30
n
age<30
n
age<30
n
age<30
y
age<30
n
age
<=30
student
no
3N
5/13 B
class
B
N
N
N
B
B
31..40
weight
1
1
1
1
1
5/13
>40
credit
3,3/13 B
yes
2B
fair
3B
ext
2N
5/13B
31
What happens if a new case has to be classified
 T: age<30,income=mid,stu=?,credit=fair,
class=has to be found
based on age goes to first subtree
but student is unknown
 with (2.0/5.4) probability it is a student
 with (3.4/5.4) probabilirty it is not a student
 (5/13 is approximately 0.4)
P(buy)= P(buy|stu)*P(stu) + P(buy|nostu)*P(notst)
=
1
*(2/5.4) +(5/44)
*(3.4/5.4)
= 0.44
P(nbuy)=P(nbuy|stu)*P(stu) + P(nbuy|nostu)*P(notst)
= 0
*(2/5.4) +
(39/44)*(3.4/5.4)
=0.56
32
Continuous variables
Income is continuous
make a binary split
try all possible splitting points
compute entropy and gain similarly
but you can use income as a test variable in
any subtree
 there is still information in income not used
perfectly
33
Avoid Overfitting in Classification
Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”
34
Motivation for pruning
A trivial tree two classes with probability p: no
1-p:yes
 conditioned on a given set of attribute values
(1) assign each case to majority: no
 error(1): 1-p
(2) assign to no with p and yes with 1-p
 error(2): p*(1-p) + (1-p)*p=2p(1-p)>1-p
for p>0.5
so simple tree has a lower error of classification
35
If the error rate of a subtree is higher then the
error obtained by replacing the tree with its most
frequent leaf or branch
 prune the subtree
How to estimate the prediction error
do not use training samples
 pruning always increase error of the training
sample
estimate error based on test set
 cost complexity or reduced error pruning
36
Pessimistic error estimates
Based on training set only
subtree covers N cases E cases missclasified
error based on training set f=E/N
 but this is not error
 resemble it as a sample from population
 estimate a upper bond on the population error
based on the confidence limit
Make a normal approximation to the binomial
distribution
37
Given a confidence level c default value is %25 in
C4.5
find confidence limit z such that
P((f-e)/sqrt(e(1-e)/N)>z)=c
 N: number of samples
 f =E/N: observed error rate
 e: true error rate
the upper confidence limit is used as a pesimistic
estimate of the true but unknown error rate
first approximation to confidence interval for error
f +/-zc/2=f +/-zc/2sqrt(f(1-f)/N))
38
Solving the above inequality
e= (f+z2/2N + z*sqrt(f/N-f2/2N+z2/4N2))
(1+z2/N)
z is number of standard deviations corresponding
to a confidence level c
 for c=0.25 z=0.69
refer to Figure 6.2 on page 166 of WF
39
Example
Labour negotiation data
 dependent variable or class to be predicted:
independent variables:
acceptability of contract: good or bad
duration,
wage increase 1th year: <%2.5, >%2.5
working hours per week: <36, >36
health plan contribution: none, half, full
Figure 6.2 WF shows a branch of a d.tree
40
wage increase 1th year
>2.5
<2.5
working hours per week
<=36
1 Bad
1 good
>36
health plan contribution
none
4 bad
2 good
a
half
1 bad
1good
b
full
4 bad
2 good
c
41
for node a
E=2,N=6 so f=2/6=0.33
plugging into the formula upper confidence limit : e= 0.47
use 0.47 a pessimistic estimate of the error
for node b
rather then the training error of 0.33
E=1,N=2 so f=1/2=0.50
plugging into the formula upper confidence limit : e= 0.72
for node c f=2/6=0.33 but e=0.47
average error=(6*0.47+2*0.72+6*0.47)/14=0.51
The error estimate for the parent healt plan is
f=5/14 and f=0.46<0.51
 so prune the node
now working hour per week node has two branchs
42
working hour per week
1 bad
1 good
bed
e=0.46
e=0.72
average pessimistic error= (2*0.72+14*0.46)/16=
the pessimistic error of the pruned tree:
f= 6/16 ep=
Exercise calculate pessimistic error decide to prune
or not based on ep
43
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
age = “<=30” AND student = “no” THEN buys_computer = “no”
age = “<=30” AND student = “yes” THEN buys_computer = “yes”
age = “31…40”
THEN buys_computer = “yes”
age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
IF
IF
IF
IF
44
Rule R: A then class C
Rule R-:A- then class C
delete condition X from A to obtain Amake a table
class C
not class C
X Y1
E1
not X Y2
E2
Y1+E1 satisfies R Y1 correct E1 misclassified
Y2+E2 cases satisfied by R- but not R
45
The total cases by R- is Y1+Y2+E1+E2
 some satisfies X some not
use a pessimistic estimate of the true error for
each rule using the upper limit UppC%y(E,N)
for rule R estimate UppC%y(E1,Y1+E1)
and for R- estimate
UppC%y(E1+E2,Y1+E1+Y2+E2)
if pessimistic error rate of R- < that of R
 delete condition x
46
Suppose n conditions for a rule
delete conditions one by one
repeat
 compare with the pessimistic error of the
original rule
 if min(R-)<R delete condition x
unit no improvement is in pessimistic error
Study example on page 49-50 of Quinlan 93
47
AnswerTree
Variables
measurement levels
case weights
frequency variables
Growing methods
Stopping rules
Tree parameters
 costs, prior probabilities, scores and profits
Gain summary
Accuracy of tree
Cost-complexity pruning
48
Variables
Categorical Variables
 nominal or ordinal
Continuous Variables
All grouping method accept all types of variables
 QUEST requires that tatget variable be nominal
Target and predictor variables
 target variable (dependent variable)
 predictor (independent variables)
Case weight and frequency variables
49
Case weight and frequency variables
CASE WEIGHT VARIABLES
unequal treatment to the cases
Ex: direct marketing
all responders but %1 nonresponders(10,000)
10,000 households respond
and 1,000,000 do not respond
case weight 1 for responders and
case weight 100 for nonresponders
FREQUENCY VARIABLES
 count of a record representing more than one
individual
50
Tree-Growing Methods
CHAID
 Chi-squared Automatic Interaction Detector
 Kass (1980)
Exhaustive CHAID
 Biggs, de Ville, Suen (1991)
CR&T
 Classification and Regression Trees
 Breiman, Friedman, Olshen, and Stone (1984)
QUEST
 Quick, Unbiased, Efficient Statistical Tree
 Loh, Shih (1997)
51
CHAID
evaluate all values of a potential predictor
merges values that are judged to be statistically
homogenous
target variable
 continuous F test
 categorical Chi-square test
not binary: produce more than two categories at
any particular level
 wider tree than do the binary methods
 works for all types of variables
 case weights and frequency variables
 missing values as a single valid category
52
CHAID Algorithm
(1) for each predictor X
 find pair of categories of X least significantly different
(has the largest p value) wrt target variable Y
 Y cont use F test
 Y nominal for a two-way cross tabulation with
categories of X as rows categories of Y a columns use
the Chi-squared test
(2) for the pair of categories of X with largest p value
 p > merge (pre specified level) than
merge this pair into a single compound category
go to step (1)
p <= merge (pre specified level) than
go to step (3)
53
CHAID Algorithm (cont)
(3) compute adjusted p values
(4) select the predictor X that has the smallest
adjusted p value (most significant)
 compare its p value to a split (pre specified
splitting level)
P<= split split the node based on categories of X
P > split do not split the node (terminal node)
continue the tree growing process until the
stopping rules are made
54
CHAID Example
cat
#
1
35
2
8
3
35
4
22
total
100
Y : output has 4 categories
1,2,3,4
X1 and X2 are inputs
X1:1,2,3,4
X2: 0,1,2
X1/Y
1
2
3
4
raw
total
1
23
5
19
4
51
2
12
2
15
13
42
3
0
1
0
1
2
4
0
0
1
4
5
col totl
35
8
35
22
100
Chi^2=25.63 df=9 (p=0.002342)
55
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
X1/Y
1
2
3
4
taw
total
2
12
2
15
13
42
2
12
2
15
13
42
cal total
35
7
34
17
93
4
0
0
1
4
5
cal total
12
2
16
17
47
Chi^2=9.19 df=3 (p=0.002682)
Chi^2=4.96 df=3 (p=0.174565)
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
3
0
1
0
1
2
X1/Y
2
3
4
taw
total
cal total
23
6
19
5
53
3
1
0
1
2
4
0
1
4
5
cal total
1
1
5
7
Chi^2=8.01 df=3 (p=0.045614)
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
4
0
0
1
4
5
cal total
23
5
20
8
56
Chi^2=3.08 df=2 (p=0.214381)
Chi^2=19.72 df=3 (p=0.000193)
X1/Y
1
2
3
4
taw total
2
12
2
15
13
42
3
0
1
0
1
2
cal total
12
3
15
14
4
Chi^2=7.23 df=3 (p=0.064814)
56
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
3,4
0
1
1
5
7
cal total
23
6
20
9
58
Chi^2=20.25 df=3 (p=0.000150)
X1/Y
1
2
3
4
taw total
2
12
2
15
13
42
3,4
0
1
1
5
7
cal total
12
3
16
18
49
Chi^2=6.40 df=3 (p=0.093339)
X1/Y
1
2
3
4
taw total
1
23
5
19
4
51
2,3,4
12
3
16
18
49
cal total
35
8
35
22
100
Chi^2=13.08 df=3 (p=0.004448)
57
X2/Y
1
2
3
4
raw total
0
33
8
34
22
97
X2/Y
1
3
raw total
1
1
0
1
0
2
1
1
1
2
2
1
0
0
0
1
2
1
0
1
col totl
35
8
35
22
100
col totl
2
1
3
Chi^2=0.75 df=1 (p=0.386476)
Chi^2=2.76 df=6 (p=0.837257)
X2/Y
1
2
3
4
raw total
0
33
8
34
22
97
1
1
0
1
0
2
col totl
34
8
35
22
99
Chi^2=0.88 df=3 (p=0.828296)
X2/Y
1
2
3
4
raw total
0
33
8
34
22
97
2
1
0
0
0
1
col totl
34
8
34
22
98
X2/Y
1
2
3
4
raw
total
0,1
34
8
35
22
99
2
1
0
0
0
1
col totl
35
8
35
22
100
Chi^2=1.87 df=3 (p=0.598558)
Chi^2=1.90 df=3 (p=0.593045)
58
Exhaustive CHAID
CHAID may not find the optimal split for a variable as
it stop merging categories
 all are statistically different
continue merging until two super categories are left
examine series of merges for the predictor
 set of categories giving strongest association with
the target
 computes and adjusted p values for that
association
best split for each predictor
 choose predictor based on p values
otherwise identical to CHAID
 longer to compute
 safer to use
59
CART
Binary tree growing algorithm
 may not present efficiently
partitions data into two subsets
 same predictor variable may be used several
times at different levels
 misclassification costs
 prior probability distribution
 Computation can take a long time with large
data sets
 Surrogate splitting for missing values
60
Impurity measures (1)
for categorical targets variables
 Gini, twoing, or (ordinal targets) ordered
twoing
for continuous targets
 least-squared deviation (LSD)
Gini
 at node t, g(t):
 g(t) = jip(j|t)*p(i|t) i and j are categories
2
 g(t) = 1-jp (j|t)
 when cases are evenly distributed g takes its
max value of 1-1/k k: number of categories
61
Impurity measures (2)
if costs are specified
g(t) = jiC(i|j)*p(j|t)*p(i|t)
C(i|j) specifies cost of misclassifiying a category
j case as category i
gini criterion function for split s at node t:
(s,t) = g(t) - pLg(tL) - pRg(tR)
PL proportion of cases in t sent to left child node
PR proportion of cases in t sent to right child node
split s maximizing the value of (s,t) is choosen
improvement in the tree
62
Impurity measures (3)
Twoing
splitting into two superclasses
best split on the predictor based on those two
superclasses
(s,t) = pLpR[j|p(j|tL)-p(j|tR)]2
the split s is chosen that maximizes this criterion
C1 : [j: p(j|tL)>=p(j|tR)]
C2= C - C1
costs are not taken into account
63
Impurity measures (4)
Ordered Twoing
 for ordinal target variables
 contiguous categories can be combined to form
superclasses
LSD
 continuous targets
 within node variance for node t
2
 R(t)= (1/Nw(t))itwnfn(yi-y_bar(t))
 where
 Nw weighted number of cases in t
wn is value of weighting variable for case I
fn is frequency variabl
yi target variable
y_bar(t) weighted mean for t
64
LSD criterion function
(s,t) = R(t) - pLR(tL) - pRR(tR)
this value weighted by the proportion of all
cases in t is the value reported as improvement
in the tree
65
Steps in the CART analysis
at the root node t=1, search for a split s*
 giving the largest decrease in impurity
 (s*,1) = max sS(s,1)
split 1 as t=2 and t=3 using s*
repeat the split searching process in t=2 and t=3
continue until one of the stopping rules is met
66
Stoping rules
all cases have identical values for all predictors
the node becomes pure; all cases have the same
value of the target variable
the depth of the tree has reached its prespecified
maximum value
the number of cases in a node less than a
prespecified minimum parent node size
split at node results in producing a child node with
cases less than a prespecified min child node size
for CART only: max decrease in impurity is less
than a prespecified value
67
QUEST
variable selection and split point selection
separately
computationally efficient than CART
68
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
69
Scalable Decision Tree Induction
Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.)
 builds an index for each attribute and only class list and
the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
 constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
 integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
 separates the scalability aspects from the criteria that
determine the quality of the tree
 builds an AVC-list (attribute, value, class label)
70
Data Cube-Based Decision-Tree
Induction
Integration of generalization with decision-tree induction
(Kamber et al’97).
Classification at primitive concept levels
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification
Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.
71
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
72
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
73
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given that
the hypothesis holds
74
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
75
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x k | C i )
k 1
The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count the class
distribution.
Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)
76
Example
H X is an apple
P(H) priori probability that X is an apple
X observed data round and red
P(H/X) probability that X is an apple given that we
observe that it is red and round
P(X/H) posteriori probability that a data is red and
round given that it is an apple
P(X) priori probabilility that it is red and round
77
Applying Bayesian Theorem
P(H/X) = P(H,X)/P(X) from Bayesian theorem
Similarly:
P(X/H) = P(H,X)/P(H)
P(H,X) = P(X/H)P(H)
hence
P(H/X) = P(X/H)P(H)/P(X)
calculate P(H/X) from
P(X/H),P(H),P(X)
78
Bayesian classification
The classification problem may be formalized
using a-posteriori probabilities:
P(Ci|X) = prob. that the sample tuple
X=<x1,…,xk> is of class Ci.
There are m classes Ci i =1 to m
E.g. P(class=N | outlook=sunny,windy=true,…)
Idea: assign to sample X the class label Ci such
that P(Ci|X) is maximal
P(Ci|X)> P(Cj|X) 1<=j<=m ji
81
Estimating a-posteriori probabilities
Bayes theorem:
P(Ci|X) = P(X|Ci)·P(Ci) / P(X)
P(X) is constant for all classes
P(Ci) = relative freq of class Ci samples
Ci such that P(Ci|X) is maximum =
Ci such that P(X|Ci)·P(Ci) is maximum
Problem: computing P(X|Ci) is unfeasible!
82
Naïve Bayesian Classification
Naïve assumption: attribute independence
P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci)
If i-th attribute is categorical:
P(xi|Ci) is estimated as the relative freq of
samples having value xi as i-th attribute in class
Ci =sik/si .
If i-th attribute is continuous:
P(xi|Ci) is estimated thru a Gaussian density
function
Computationally easy in both cases
83
Training dataset
age
Class:
<=30
C1:buys_computer= <=30
‘yes’
30…40
C2:buys_computer= >40
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
84
Solution
Given the new customer
What is the probability of buying computer
X=(age<=30 ,income =medium,
student=yes,credit_rating=fair)
Compute
P(buy computer = yes/X) and
P(buy computer = no/X)
Decision:
list as probabilities or
chose the maximum conditional probability
85
Compute
P(buy computer = yes/X) =
P(X/yes)*P(yes)/P(X)
P(buy computer = no/X)
P(X/no)*P(no)/P(X)
Drop P(X)
Decision: maximum of
 P(X/yes)*P(yes)
 P(X/no)*P(no)
86
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
 P(X/C = yes)*P(yes)
P(age=“<30” | buys_computer=“yes”)*
P(income=“medium” |buys_computer=“yes”)*
P(credit_rating=“fair” | buys_computer=“yes”)*
P(student=“yes” | buys_computer=“yes)*
P(C =yes)
87
P(X/C = no)*P(no)
P(age=“<30” | buys_computer=“no”)*
P(income=“medium” | buys_computer=“no”)*
P(student=“yes” | buys_computer=“no”)*
P(credit_rating=“fair” | buys_computer=“no”)*
P(C=no)
88
Naïve Bayesian Classifier: Example
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
P(buys_computer=“yes”)=9/14=0,643
P(buys_computer=“no”)=5/14=0,357
89
P(X|buys_computer=“yes”)
= 0.222 x 0.444 x 0.667 x 0.0.667 = 0.044
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)
=0.044*0.643=0.02
P(X|buys_computer=“no”)
= 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|buys_computer=“no”) * P(buys_computer=“no”)
=0.019*0.357=0.0007
X belongs to class “buys_computer=yes”
90
Summary of calculations
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)
=0.044*0.643=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
=0.019*0.357=0.0007
X belongs to class “buys_computer=yes”
91
Naïve Bayesian Classifier: Comments
Advantages :
 Easy to implement
 Good results obtained in most of the cases
Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
How to deal with these dependencies?
 Bayesian Belief Networks
92
Bayesian Networks
Bayesian belief network allows a subset of the variables
conditionally independent
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Y
X
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles
93
Bayesian Belief Network: An Example
Family
History
Smoker
(FH, S)
LungCancer
PositiveXRay
Emphysema
Dyspnea
Bayesian Belief Networks
(FH, ~S) (~FH, S) (~FH, ~S)
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
n
P( z1,..., zn ) 
 P( z i | Parents( Z i ))
i 1
94
Learning Bayesian Networks
Several cases
 Given both the network structure and all variables
observable: learn only the CPTs
 Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
 Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
 Unknown structure, all hidden variables: no good
algorithms known for this purpose
D. Heckerman, Bayesian networks for data mining
95