Download Classification and Prediction

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification and
Prediction
CS 5331 by Rattikorn Hewett
Texas Tech University
1
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
2
1
Tasks
n 
Classification & Prediction
¨ 
¨ 
describe data with respect to a target class/output
create, from a data set, a model that
n 
¨ 
use the model to
n 
n 
n 
describes a target class in terms of features (or attributes or
variables) of the class or system generating the data
classify an unknown target class
predict an unknown output
Example Applications
¨ 
¨ 
¨ 
credit approval
medical diagnosis
treatment effectiveness analysis
What about image recognition or customer profiling?
3
Classification vs. Prediction
Labeled
Data
Algorithm
class label or
predicted output
Model
Data with unknown
class label or output
n 
Comparisons:
Data
Model
Use of the model
Typical unknown values
Classification
Prediction
labeled (by known target class)
classifier/classification model
classify unknown label
categorical
labeled (by known output)
predictive model
predict unknown output
continuous/ordered
4
2
Three-step process
n 
Model Construction (Training/Learning)
Build a model that describes training data labeled with known
classes/outputs
¨  A model could be of various forms: rules, decision trees, formulae
¨ 
n 
Model Evaluation (Testing)
¨ 
Use a labeled testing data set to estimate model accuracy
n 
n 
n 
The known label of test sample is compared with the classified result
from the model
Accuracy = % test cases correctly classified by the model
Model Usage (Classification)
¨ 
Use the model with acceptable accuracy to classify or predict
unknown
5
Example
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
inst
1
2
3
4
5
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Credit
history
Bad
Good
U
U
U
Debt Collateral Income
H
H
L
L
L
L
L
L
L
H
H
H
H
H
Debt
H
L
L
M
L
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
L
M
M
L
H
H
L
H
H
H
L
M
H
M
Collateral Income
Ok
None
None
None
None
L
M
M
L
H
Classification
Algorithm
Risk
H
H
M
H
L
L
H
M
L
L
H
M
L
H
Risk
H
H
M
H
L
Model Construction
Classifier (Model):
1. If income = L then Risk = H
2. If income = H and Credit History = U
then Risk = L
…….
Model Evaluation
Accuracy = ?
inst
1
2
Credit
history
U
Bad
Debt
H
L
Model Usage
Collateral Income
Ok
None
L
M
Risk
?
?
6
3
Machine learning context
no
Training
Training
Data
Testing
Testing
Data
Labeled Data
Algorithm
Model
Is the model
sufficiently
accurate?
yes
Model:
Data with unknown
class label or output
Input Data
• classifier
• predictive model
Output Model
class label or
predicted output
7
Supervised vs. Unsupervised
n 
Supervised learning
Class labels and # classes are known
Labeled data: the training data (observations, measurements,
features etc.) are accompanied by labels indicating the class of the
observations
¨  Goal is to describe class/concept in terms of features/observations
¨ 
¨ 
à classification/prediction
n 
Unsupervised learning
¨ 
¨ 
¨ 
Class labels are unknown and # classes may not be known
Unlabeled data: the training data does not have class labels
Goal is to establish the existence of classes or concepts or clusters
in the data
à clustering
8
4
Terms
n 
n 
n 
Machine learning aims to abstract regularities from a given data set
¨  Mining ~ learning ~ induction (reasoning from specifics to general)
A training set consists of training instances
¨  The terms instances, tuples (or rows), examples, samples, and
objects are used synonymously
A testing set is independent of a training set
¨  If a training set is used for testing during learning process, the
resulting model will overfit the training data
à the model could incorporate anomalies in the training data
that are not present in the overall sample population
à model that may not represent the population (poor generalization)
9
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
10
5
Evaluation
n 
n 
A resulting model (classifier) is tested against a testing
data set to measure its
¨  Accuracy or
¨  Error rate = # misclassified test cases/#test cases in %
Experimental methods
¨ 
¨ 
¨ 
¨ 
Holdout method
Random subsampling
Cross validation
Bootstrapping
11
Holdout method
n 
Randomly partition a data set into two independent sets:
a training set and a test set (typically 2/3 and1/3)
¨ 
The training set is used to derive a classifier, whose accuracy is
estimated using the test set
Training
Data
Classifier
(Model)
Estimate
Accuracy
Data
Testing
Data
n 
Use with data set that has large number of samples
12
6
Random sub-sampling
n 
n 
Repeat holdout method k times
Overall accuracy = average of the accuracies in each
iteration
Classifier
Training
Data
(Model)
Testing
Data
•
•
•
Training
Data
Classifier
(Model)
Testing
Data
What happens when the data
requires supervised discretization
(e.g., Fayad&Irani’’s alg.)?
Data
Data
Estimate
Accuracy
•
•
•
Average
accuracy
Estimate
Accuracy
13
Cross validation
n 
k-fold cross validation
¨ 
¨ 
¨ 
Divide a data set into k approximately equal subsets of samples
Use k-1 subsets for training set and the rest for testing set
Obtain a classifier/accuracy from each pair of training and testing
sets (k possibilities)
Train Data
Test Data
Data
•
•
•
Train Data
Test Data
Classifier
(Model)
•
•
•
Classifier
(Model)
Estimate
Accuracy
•
•
•
Average
accuracy
Estimate
Accuracy
14
7
Cross validation & Bootstrapping
n 
Stratified cross validation – the folds are stratified so that
the class distribution of the samples in each fold is
approximately the same as that in the initial data
n 
Stratified 10-fold Cross-validation is recommended when a
sample set is of medium size
n 
When sample set size is very small à Bootstrapping
¨  Leave-one-out sample training set with replacement
15
Improving Accuracy
n  Use
a committee of classifiers and
combine the results of each classifier
n  Two
basic techniques:
¨ Bagging
¨ Boosting
16
8
Bagging
n 
Generate independent training sets by random
sampling with replacement from the original sample set
n 
Construct a classifier for each training set using the
same classification algorithm
n 
To classify an unknown sample x, each classifier returns
its class prediction, which counts as one vote
n 
The Bagged Classifier C* counts the votes and assigns x
to the class with the “most” votes
17
Bagging
Test
Data Set
Training
Data Set
Training
Sample 1
Classifier 1
Training
Sample 2
Classifier 2
Training
Sample n
Classifier n
Aggregated
Classifier, C*
by
Majority of
the votes
Resulting
classes
18
9
Boosting
n 
n 
n 
Assign weight to each training sample
A series of classifiers is learned. After classifier Ct is learned
¨  Calculate the error and re-weight the examples based on the
error.
¨  The weights are updated to allow the subsequent classifier Ct+1
to “pay more attention” to the misclassification errors made by
Ct
The final boosted classifier C* combines the votes of each classifier,
where the weight of each classifier’s vote is a function of its
accuracy on the training set
19
Boosting
Test
Data Set
Weighted
Training
Data Set
Classifier 1
Weighted
Training
Data Set
Classifier 2
Weighted
Training
Data Set
Classifier 3
Classifier n-1
Weighted
Training
Data Set
Classifier n
Aggregated
Classifier, C*
by
weighted
votes
Resulting
classes
Weighted are based on
resulting accuracy
20
10
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
21
Issues on classification/prediction
n 
How do we compare different techniques?
¨ 
n 
No single method works well for all data sets
Some evaluation criteria:
¨ 
¨ 
¨ 
¨ 
¨ 
Accuracy – indicates generalization power
Speed – time to train/learn a model and time to use the model
Robustness – tolerant to noise and incorrect data
Scalability – acceptable run time to learn as data size and complexity grow
Interpretability – comprehensibility and insight from the resulting model
22
11
Issues on classification/prediction
n 
Any alternatives to the accuracy measure?
¨ 
¨ 
¨ 
Sensitivity = t-pos/pos
Specificity = t-neg/neg
Accuracy = sensitivity (pos/(pos+neg)) + specificity (neg/(pos+neg)
= (t-pos + t-neg)/(pos+neg)
Actual
pos = # positive samples
neg = # negative samples
t-pos = # of true positives
t-neg = # of true negatives
f-pos = # of false positives
f-neg = # of false negatives
cancer
~ cancer
cancer
Predicted
t-pos
f-pos
~ cancer
f-neg
t-neg
23
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
24
12
Input/Output
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Income
Debt
H
H
L
L
L
L
L
L
L
H
H
H
H
H
Collateral Income
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
L
M
M
L
H
H
L
H
H
H
L
M
H
M
Risk
H
H
M
H
L
L
H
M
L
L
H
M
L
H
Training data set
L
M
H Risk
Credit history
U
debt
H
H Risk
Bad Good
H Risk
M Risk
H
Credit history
U
Bad
Good
L Risk
M Risk
L
M Risk
L Risk
Decision Tree Model
Does the tree and the data set cover the same information?
25
Decision Tree (DT) Classification
n 
Typically has two phases:
¨ 
¨ 
n 
Tree construction (Induction)
¨ 
¨ 
n 
Tree construction
Tree pruning
Initially, all the training examples are at the root
Recursively partition examples based on selected attributes
Tree pruning
¨ 
¨ 
Remove branches that may reflect noise/outliers in the data
Give faster or more accurate classification
26
13
Decision Tree Induction
Basic Algorithm
n 
n 
n 
Data are categorical
Tree starts a single node representing all data
Recursively,
¨ 
¨ 
¨ 
select a split-attribute node that best separates sample classes
If all samples for a given node belong to the same class
à the node becomes a leaf labeled with class label
If no remaining attributes on which samples may be partitioned
Or there are no samples for the attribute value branch
à a leaf is labeled with majority class label in samples
27
Example
{1,2,....,14}
Income
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Debt Collateral Income
H
H
L
L
L
L
L
L
L
H
H
H
H
H
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
L
M
M
L
H
H
L
H
H
H
L
M
H
M
Risk
H
H
M
H
L
L
H
M
L
L
H
M
L
H
L
{1,4,7,11}
M
{2,3,12,14}
H
{5,6,8,9,10,13}
H Risk
28
14
Example
{1,2,....,14}
Income
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Debt Collateral Income
H
H
L
L
L
L
L
L
L
H
H
H
H
H
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
L
M
M
L
H
H
L
H
H
H
L
M
H
M
Risk
H
H
M
H
L
L
H
M
L
L
H
M
L
H
L
M
{1,4,7,11}
H
{2,3,12,14}
{5,6,8,9,10,13}
H Risk
Credit history
U
Bad
Good
{5,6}
L Risk
M Risk
{8}
{9,10,13}
L Risk
29
Example
{1,2,....,14}
Income
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Debt Collateral Income
H
H
L
L
L
L
L
L
L
H
H
H
H
H
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
L
M
M
L
H
H
L
H
H
H
L
M
H
M
Risk
H
H
M
H
L
L
H
M
L
L
H
M
L
H
L
M
{1,4,7,11}
H Risk
U
{2,3}
{2,3,12,14}
H
{5,6,8,9,10,13}
Credit history
Credit history
Bad
{14}
H Risk
Good
{12}
M Risk
U
Bad
Good
{5,6}
L Risk
{8}
M Risk
{9,10,13}
L Risk
30
15
Example
{1,2,....,14}
Income
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Debt Collateral Income
H
H
L
L
L
L
L
L
L
H
H
H
H
H
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
L
M
M
L
H
H
L
H
H
H
L
M
H
M
Risk
H
H
M
H
L
L
H
M
L
L
H
M
L
H
L
M
{1,4,7,11}
{2,3,12,14}
H Risk
debt
H
{2}
H Risk
{5,6,8,9,10,13}
Credit history
Credit history
U
{2,3}
H
Bad
{14}
H Risk
Good
{12}
M Risk
U
Bad
Good
{5,6}
L Risk
{8}
M Risk
L
{3}
{9,10,13}
L Risk
M Risk
31
Selecting split-attribute
n 
n 
Based on a measure called Goodness function
Different algorithms use different goodness functions:
¨  Information gain (e.g. in ID3/C4.5 [Quinlan, 94])
n 
n 
Assume categorical attribute values - modifiable to continuous-valued
attributes
Choose attribute with highest information gain
¨  Gini
n 
n 
n 
Index(e.g., in CART, IBM IntelligentMiner)
Assume continuous-valued attributes – modifiable to categorical
attributes
Assume several possible split values for each attribute – may need
clustering to find possible split values
Choose attribute with smallest gini index
¨  Others e.g., gain ratio, distance-based measure, etc.
32
16
Information gain
n 
Choose attribute with highest information gain to be the test
attribute for current node
n 
Highest information gain = greatest entropy reduction
à least randomness in sample partitions
à min information needed to classify the sample partitions
33
Information gain
n 
Let S be a set of labeled samples, then the information
needed to classify a given sample is
Ent ( S ) = −∑ pi log 2 pi where pi is a probability that a sample from S is in
class i. Here pi = |Ci|/|S|, where Ci = a sample set
i∈C
from S whose class label is i
n 
Expected information needed to classify a sample if it is
Probability that a sample from S whose
partitioned into subsets by A.
A attribute value is i∈ dom(A)
I ( A) =
∑
i∈dom ( A )
n 
Si
S
Ent ( Si ) where Si is a partition of a sample set whose A attribute
Information gain:
value is
i, and dom(A) = all possible values of A
Gain(A) = Ent(S) – I(A)
34
17
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Example
Goal - to find attribute A with highest gain(A) on S
S = {1, 2, 3, …, 14} - each number represent a sample row
C = {H Risk, M Risk, L Risk}
{1,2,4,7,11,14} {3,8,12} {5,6,9,10,13}
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
Debt
Collateral Income
H
H
L
L
L
L
L
L
L
H
H
H
H
H
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
Risk
L
M
M
L
H
H
L
H
H
H
L
M
H
M
H
H
M
H
L
L
H
M
L
L
H
M
L
H
Ent(S) = - p(H Risk)log2(p(H Risk)) - p(M Risk)log2(p(M Risk)) - p(L Risk)log2(p(L Risk))
= - 6/14log2(6/14) - 3/14log2(3/14) - 5/14log2(5/14) = 1.531
To find Gain(Income) on S, Dom(Income) = {H Income, M Income, L Income}
{5,6,8,9,10,13} {2,3,12,14} {1,4,7,11}
I(Income) = p(H income)Ent(H income) + p(M income)Ent(M income) + p(L income) Ent(L income)
6/14
4/14
4/14
H Risk
M Risk
L Risk
Ent(H income) = Ent({5,6,8,9,10,13}) = - 0log2(0)- 1/6log2(1/6)- 5/6log2(5/6) = .65
Ent(M income) = Ent({2,3,12,14})
Ent(L income) = Ent({1,4,7,11})
= - 2/4log2(2/4)- 2/4log2(2/4)- 0log2(0) =
1
= - 4/4log2(4/4)- 0log2(0)- 0log2(0)
0
=
I(income) = 6/14(0.65) + 4/14(1) + 4/14(0) = 0.564
Gain(income) = Ent({1,..14}) – I(income) = 1.531 – 0.564 = 0.967
35
inst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Example
Goal - to find attribute A with highest gain(A) on S
S = {1, 2, 3, …, 14} - each number represent a sample row
C = {H Risk, M Risk, L Risk}
{1,2,4,7,11,14} {3,8,12} {5,6,9,10,13}
Gain(credit history) = 0.266
Gain(debt) = 0.581
Gain(collateral) = 0.756
à Select income as a root
Debt
Income
Gain(income) = Ent(S) – I(income) = 1.531 – 0.564 = 0.967
Similarly,
Credit
history
Bad
U
U
U
U
U
Bad
Bad
Good
Good
Good
Good
Good
Bad
L
H
H
L
L
L
L
L
L
L
H
H
H
H
H
None
None
None
None
None
Ok
None
Ok
None
Ok
None
None
None
None
Risk
L
M
M
L
H
H
L
H
H
H
L
M
H
M
H
H
M
H
L
L
H
M
L
L
H
M
L
H
S = {1, 2, 3, …, 14}
M
{1,4,7,11}
Collateral Income
{2,3,12,14}
H
{5,6,8,9,10,13}
H Risk
Recursive process on
S = {2,3,12,14}
To pick next attribute
Credit history
Recursive process on
S = {5,6,8,9,10,13}
To pick next attribute
Credit history
36
18
Gini Index
n 
For a data set T containing examples from n classes, define
n 2
gini(T ) = 1− ∑ p j
j =1
n 
where pj is the relative frequency of class j in T
If a data set T is split into two subsets T1 and T2, the gini index of the
split data contains examples from n classes is defined as
gini split (T ) =
n 
|T 1|
|T |
gini(T 1) + 2 gini(T 2)
N
N
The attribute provides the smallest ginisplit(T) is chosen to split the
node (need to enumerate all possible splitting points for each
attribute).
37
DT induction algorithm
Characteristics:
n 
n 
Greedy search algorithm
¨  Make optimal choice at each step – select the “best”
split-attribute for each tree node
Top-down recursive divide-and-conquer manner
¨  From root to leaf
¨  Split node to several branches and each branch,
recursively run algorithm to build a subtree
38
19
DT Construction Algorithm
Primary Issues:
n 
Spilt criterion – for selecting split-attribute
¨ 
n 
Different algorithms use different goodness functions: information gain,
gini index etc.
Branching scheme – how many sample partitions
¨ 
n 
Stopping condition – when to stop further splitting
¨ 
n 
Binary branches (gini index) vs. multiple branches (information gain)
Fully-grown vs. stopping early
Labeling class
¨ 
Node is labeled with the most common class
39
From Trees to Rules
Income
L
n 
Represent the knowledge in the
form of IF-THEN rules
n 
Create one rule for each path
from the root to a leaf
n 
¨ 
Each attribute-value pair along
a path forms a conjunction
¨ 
The leaf node holds the class
prediction
Rules are easier for humans to
understand
M
H Risk
H
Credit history
U
debt
H
Bad Good
H Risk
Bad
Good
L Risk
M Risk
L
H Risk
M Risk
Credit history
U
M Risk
L Risk
Example:
IF income = “M” AND credit-history =
“U” AND debt = “H” THEN risk =
“H”
IF income = “M” AND credit-history =
“Good” THEN risk = “M”
40
20
Overfitting & Tree Pruning
n 
A tree constructed may overfit the training examples due to noise or
too small a set of training data à Poor accuracy for unseen samples
n 
Approaches to avoid overfitting
¨  Prepruning: Stop growing the tree early—do not split a node if
this would result in the goodness measure falling below a
threshold
n  Difficult to choose an appropriate threshold
¨ 
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees then decide which is the
“best pruned tree”
Combined
Postpruning is more expensive than prepruning but more effective
¨ 
n 
41
Postpruning
n 
Merge a subtree into a leaf node
¨  If
accuracy without splitting > accuracy with splitting, then don’t spilt
⇒ replace the subtree with a leaf node, label it with a majority class
n 
Techniques:
¨  Cost
n 
n 
¨  Min
n 
n 
complexity pruning – based on expected error rates
For each non-leaf node in the tree, compute expected error rate if the
subtree was and was not pruned – combined error rates from each
branch using weights on sample frequency ratios
Requires an independent test sample to estimate the accuracy of
each of the progressively pruned trees
description length pruning – based on # of tree encoding bits
The “best pruned tree” minimizes the # of encoding bits
No test sample set is required
42
21
Postpruning (cont)
The correct tree size can be determined by:
n  Using a separate
E.g., CART
test set to evaluate the pruning
n  Using
all the data for training but
¨  applying a statistical test (e.g., chi-square) to
evaluate the pruning
E.g., C4.5
n  Using
¨ 
minimum description length (MDL) principle
halting growth of the tree when the encoding is minimized
E.g., SLIQ, SPRINT
43
Tree induction Enhancement
n 
Allow for continuous-valued attributes
¨ 
Dynamically define a discrete value that partitions the continuous values
into a discrete set of intervals
n 
n 
Handle missing attribute values
¨ 
n 
sort continuous attribute values, identify adjacent values with
different target classes, generate candidate thresholds midway, and
select the one with max gain
Assign the most common value OR probability to each possible values
Reduce tree fragmentation, testing repetition , and
subtree replication
¨ 
Attribute construction - create new attributes based on existing ones that
are sparsely represented
44
22
Classification in Large Databases
n 
Classification: a classical problem extensively studied by statisticians
and machine learning researchers
n 
Goal: classifying data sets with millions of examples and hundreds of
attributes with reasonable speed
n 
Decision tree induction seems to be a good candidate
¨ 
¨ 
¨ 
¨ 
n 
relatively faster learning speed (than other classification methods)
convertible to simple and easy to understand classification rules
can be used to generate SQL queries for accessing databases
has comparable classification accuracy with other methods
Primary issue: scalability
¨  most algorithms assume data can fit in memory – need to swap
data in/out of main and cache memories
45
Scaling up decision tree methods
n 
Incremental tree construction [Quinlan, 86]
¨ 
¨ 
n 
Data reduction [Cattlet, 91] – still a main memory algorithm
¨ 
n 
Using partial data to build a tree
Testing of additional examples and examples that are
misclassified are used to rebuild the tree interactively
Reducing data size by sampling and discretization
Data partition and merge [Chan and Stolfo, 91]
¨ 
¨ 
¨ 
Building trees for each partition of the data
Merging trees into one tree
But … resulting accuracy is reported to be reduced
46
23
Recent efforts in Data Mining Studies
n 
SLIQ (EDBT’96 — Mehta et al.) & SPRINT (VLDB’96 — J. Shafer et al.)
¨ 
¨ 
¨ 
n 
PUBLIC (VLDB’98 — Rastogi & Shim)
¨ 
n 
presort disk-resident data sets (that are too large to fit in main memory)
handle continuous-valued attributes
define a new data structure to facilitate tree construction
integrates tree splitting and tree pruning
RainForest
¨ 
(VLDB’98 — Gehrke, Ramakrishnan & Ganti)
A framework for scaling decision tree induction that separates
scalability from quality criteria
47
SLIQ
n 
n 
n 
Tree building uses a data structure: attribute lists, a class list
Each attribute has an associated attribute list,
indexed by record ID
Each tuple is a linkage from each attribute list to a class list
entry and to a leaf node of the decision tree
RID
1
2
credit-rate
good
fair
credit-rate
good
fair
RID
1
2
Attribute lists
age
30
28
age
28
30
buy-car?
yes
no
RID
2
1
RID
1
2
buy-car?
yes
no
Class list
node
48
24
SLIQ
Efficiency is contributed by the fact that:
n  Only class list and current attribute list are memory-resident
¨ 
n 
Class list is dynamically modified during tree construction
Attribute lists are disk-resident, presorted (reduce cost to evaluate
and indexed (eliminate resorting data)
Use fast subsetting algorithm to determine split-attributes
splitting)
n 
¨ 
n 
If number of possible subsets exceeds threshold, use greedy search
Use inexpensive MDL-based tree pruning
49
SPRINT
n 
Proposes a new data structure: attribute list eliminating
SLIQ’s requirement on memory-resident class list
RID
1
2
credit-rate
good
fair
Attribute 1
credit-rate
good
fair
n 
class
buy-car?
yes
no
buy-car?
yes
no
Attribute 2
RID
1
2
age
28
30
class
buy-car?
no
yes
RID
2
1
Outperforms SLIQ when the class list is too large to fit
main memory
¨ 
n 
age
30
28
but needs a hash tree to connect different joins which could be
costly with a large training set
Designed to ease parallelization
50
25
PUBLIC
n 
n 
Integrates splitting and pruning
Observation & idea:
¨ 
¨ 
n 
a large portion of the tree ends up being pruned
Can we use a top-down approach to predetermine this and stop
growing it earlier?
How?
Before expanding a node, compute a lower bound estimation on
the cost subtree rooted at the node
¨  If a node is predicted to be pruned (based on the cost estimation),
return it as a leaf, otherwise go on splitting it
¨ 
51
RainForest
n 
A generic framework that
separates the scalability aspects from the criteria that determine
the quality of the tree
¨  Applies to any decision tree induction algorithm
¨  Maintains an AVC-set (attribute, value, class label)
¨ 
n 
Reports a speedup over SPRINT
52
26
Data Cube-Based DT Induction
n 
Integration of generalization with DT induction – once the
tree is derived, use a concept hierarchy to generalize
n 
¨ 
Generalization at low-level concepts à large, bushy trees
¨ 
Generalization at high-level concepts à lost interestingness
Cube-based multi-level classification
¨ 
Relevance analysis at multi-levels
53
Visualization in Classification
DBMiner - Presentation
of classification rules
54
27
Visualization in Classification
SGI/MineSet 3.0
Decision tree
visualization
55
Visualization in Classification
Interactive Visual Mining
by Perception-Based
Classification (PBC)
56
28
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
57
Bayesian classification
Basic features/advantages:
n 
n 
n 
Probabilistic classification
¨  Gives explicit probability that a given example belongs to a
certain class
¨  Predict multiple hypotheses, weighted by their probabilities
Probabilistic learning
¨  each training example can incrementally increase or decrease
the probability of the hypothesis
Probabilities are theoretical-supported
¨  Provide standard measures to facilitate decision-making and
comparison with other methods
58
29
Bayesian-based approaches
n 
Basic approaches
¨  Naïve
Bayes Classification
¨  Bayesian network learning
n 
Both are based on Bayes Theorem
59
Bayes Theorem
n 
X - an observed training example
H - a hypothesis that X belongs to a class C
Goal of classification: determine P(H|X), the probability
that H holds given X
Bayes Theorem:
P(H | X ) = P( X | H )P(H )
n 
Informally, this can be written as
n 
n 
n 
P( X )
Posterior probability of H conditioned on X
Prior Probabilities
posterior = likelihood x prior / evidence (RHS can be estimated)
n 
Practical issues:
¨ 
¨ 
require initial knowledge of many probabilities
significant computational cost
60
30
Naïve Bayes Classifier
n 
a data example X, is an n-dimensional vector à n attributes
n 
Naïve assumption of class conditional independence
¨ 
¨ 
For a given class label, attribute values are conditionally
independent of one another
No dependence relationships between attributes
P( X | C i) =
n 
n 
n
∏ P( x k | C i)
k =1
where Ci is a class label
The assumption reduces the computation cost, only count
the class distribution.
Once the probability P(X|Ci) is known, assign X to the class
with maximum P(X|Ci)P(Ci) – Why? and How?
61
Naïve Bayes Classifier
Once the probability P(X|Ci) is known, assign X to the class
with maximum P(X|Ci)P(Ci) – Why?
The classifier predicts class of X to be the one that gives
maximum P(Ci| X) for all possible classes Ci’s
By Bayes theorem, P(Ci | X ) = P( X | Ci ) P(Ci )
P( X )
Since P(X) is the same for all classes, only P(X|Ci)P(Ci)
needs to be maximized
62
31
age
Example
X = (age = <30, income =medium,
student=yes, credit_rating=fair)
Compute P(X|Ci) for each class
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student
high
no
high
no
high
no
medium no
low
yes
low
yes
low
yes
medium no
low
yes
medium yes
medium yes
medium no
high
yes
medium no
credit_rate buys_car
fair
no
excellent
no
fair
yes
fair
yes
fair
yes
excellent
no
excellent
yes
fair
no
fair
yes
fair
yes
excellent
yes
excellent
yes
fair
yes
excellent
no
P(X|Ci) : P(X|buys_car=“yes”) = P(age=“<30” | buys_car=“yes”)
P(income=“medium” | buys_car=“yes”)
P(student=“yes” | buys_car=“yes”)
P(credit_rating=“fair” | buys_car=“yes”)
= 2/9 x 4/9 x 6/9 x 6/9 = 0.0444
P(X|buys_car=“no”) = 3/5 x 2/5 x 1/5 x 2/5 = 0.019
P(X|Ci)*P(Ci ) :
P(X|buys_car=“yes”) * P(buys_car=“yes”) = 0.0444*9/14 = 0.028
P(X|buys_car=“no”) * P(buys_car=“no”) = 0.019*5/14 = 0.007
X belongs to class “buys_car=yes”
63
Naïve Bayesian Classifiers
n 
n 
P(xk|Ci) for a continuous-valued attribute, Ak can be
estimated from a normal density function
Advantages :
¨ 
n 
Easy to implement & good results obtained in most of the cases
Disadvantages
Loss of accuracy when the class conditional independence
assumption is violated
¨  In practice, dependencies exist among variables, e.g.,
¨ 
n 
n 
n 
hospitals: patients: Profile: age, family history etc ; symptoms etc.
Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
How to deal with these dependencies?
à Bayesian Belief Networks
64
32
Bayesian Networks
n 
n 
Bayesian belief network (Bayes net, Bayesian network)
allows class conditional dependencies to be represented
A graphical model of dependency among the variables
¨ 
¨ 
Gives a specification of joint probability distribution
Two components: DAG (Directed Acyclic Graph)
CPT (Conditional Probability Table)
representing P(X | parents(X)) for each node
X
Y
X
Z
q  Nodes: random variables
q  Links: dependency
q  X,Y are the parents of Z
Y is the parent of P
q  No dependency between Z and P
q  Has no loops or cycles
P
An Example
Family
History
65
CPT of Lung Cancer (LC)
Smoker
(FH,S) (FH,~S) (~FH,S) (~FH,~S)
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
P(LC=yes|FH=yes and S = no) = 0.5
LungCancer
PositiveXRay
Emphysema
Dyspnea
The joint probability of a tuple (z1, ..,zn)
corresponding to variables Z1,..,Zn is
n
P(z1,...,zn ) = ∏ P( zi | Parents(Z i))
i =1
Bayesian Belief Networks
€
66
33
Classification with Bayes Nets
n 
The classification process can return a probability distribution for the class
attribute instead of a single class label
n 
Learning Bayes Nets
¨ 
¨ 
¨ 
¨ 
Causal learning/discovery is not the same as Causal Reasoning
Output: A Bayes Net representing regularities in an input data set
Network structures may be known or unknown:
n  structure known + all variables observable: training to learn CPTs
n  structure known + some hidden variables:
use gradient descent - analogous to neural network learning
n  structure unknown + all variables observable:
Search through the model space to reconstruct structure
n  structure unknown + all hidden variables: No good algorithms known
Two basic approaches: Bayesian learning and constraint-based learning
67
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
68
34
Classification & function approximation
Classification: predicts categorical class labels
Typical Applications:
{credit history, salary} à Credit approval (Yes/No)
{Temp, Humidity} àRain (Yes/No)
Mathematically,
n 
n 
x ∈ X = {0,1}n , y ∈ Y = {0,1}
h: X →Y
y = h( x )
We want to estimate function h
69
Linear Classification
n 
x
x
x
x
x
x
x
x
x
ooo
o
o
o o
x
o
o
o o
o
o
Binary Classification problem
¨ 
¨ 
n 
The data above the line belongs to
class ‘x’
The data below line belongs to
class ‘o’
Discriminative Classifiers, e.g.,
¨ 
¨ 
SVM (support vector machine)
Perceptron (Neural net approach)
70
35
Discriminative Classifiers
n 
n 
Advantages
¨  prediction accuracy is generally high
as compared to Bayesian methods – in general
¨  robust, works when training examples contain errors
¨  fast evaluation of the learned target function
Bayesian networks are normally slow
Criticism
¨  long training time
¨  difficult to understand the learned function (weights)
Bayesian networks can be used easily for pattern discovery
¨  not easy to incorporate domain knowledge
easy in the form of priors on the data or distributions
71
Neural Networks
n 
Analogy to Biological Systems
n 
Massive Parallelism allowing for computational efficiency
n 
The first learning algorithm came in 1959 (Rosenblatt)
who suggested that if a target output value is provided
for a single neuron with fixed inputs, one can
incrementally change weights to learn to produce these
outputs using the perceptron learning rule
72
36
A Neuron
x0
w0
x1
w1
xn
bias: acts as a threshold to
θ-k
f
∑
output y
wn
Input
weight
vector x vector w
n 
vary the activity
weighted
sum
Activation (Sigmoid)
function
The n-dimensional input vector x is mapped into variable y by
means of the scalar product and a nonlinear function mapping, e.g.,
n
y = sign( ∑ wi xi + θ k )
i =0
73
Multilayer Neural Networks
Input vector
Input nodes
x1
1
w14
Hidden nodes
x2
2
3
w15
w34
4
5
w35
w46
Output nodes
x3
Three-layer neural network
•  Feed-forward – no feedback
•  Fully connected
w56
6
Output vector
Multilayer feed-forward networks of linear threshold functions with
enough hidden units can closely approximate any function
74
37
Defining a network structure
Before training, …
n  Determine # units in input layer, hidden layer(s), and
output layer
¨ 
n 
Normalize input values to speedup learning
¨ 
¨ 
n 
No clear rules on the “best” number of hidden layer units
Continuous values à range between 0 and 1
Discrete values à Boolean values
Accuracy depends on both network topology and initial
weights
75
Training Neural Networks
n 
n 
Goal
¨  Find a set of weights that makes almost all the tuples in the
training data classified correctly (or acceptable error rates)
Steps
¨  Initialize weights with random values
¨  Feed the input tuples into the network one by one
¨  For each unit
n  Compute the net input to the unit as a linear combination of all
the inputs to the unit
n  Compute the output value using the activation function
n  Compute the error
n  Update the weights and the bias
¨  Repeat until terminal condition is satisfied
76
38
Backpropagation
Learning ~ searching using a gradient descent method
Searches for a set of weights
n 
n 
¨ 
that minimizes the mean squared distance between the
network’s class prediction and the actual class label of the
samples
A learning rate helps avoid local minimum problem
n 
¨ 
¨ 
¨ 
If the rate is too low à learning at a slow pace
If the rate is too high à oscillation between solutions
Rules of thumb: set it to 1/t where t = # iterations through the training
set so far
77
Backpropagation
x1
x2
1
w14
θ4
x3
2
n 
¨ 
3
w15
w34
4
5
w35
w46
w56
6
θ6
θ5
Initialization
Weights & biases: small random
numbers (e.g., from -1 to 1)
w14 w15 w24 w25 w34 w35 θ4 θ5 θ6
For each training example x =(x1, x2 , x3)=(1, 0, 1)
n 
Propagate the input forward
I j = ∑ wij Oi + θ j
i
Sigmoid activation function
Oj =
1
1+ e
−I j
I4 = w14O1 + w24O2 +
w34O3 + θ4
where I1 = x1 =1
I5 similarly
Compute O4 , O5
I6 = w46O4 + w56O5 + θ6
Then, compute O6
78
39
Backpropagation
x1
x2
1
x3
2
w14
θ4
n 
Err j = O j (1 − O j )(T j − O j )
3
w15
w34
4
5
Backpropagate the error
Err6 = O6 (1- O6) (T6 – O6), where
T6 is a given target class label = 1
w35
Err j = O j (1 − O j )∑ Errk w jk
θ5
k
w46
w56
6
Err5 = O5 (1- O5) (Err6w56)
Err4 = O4 (1- O4) (Err6w46)
θ6
n 
Update weights & biases
wij = wij + (l ) Err j Oi
θ j = θ j + (l) Err j
79
Backpropagation
x1
x2
1
w14
θ4
x3
2
3
w15
w34
4
5
w35
w46
w56
6
θ6
θ5
n 
Update weights & biases
wij = wij + (l ) Err j Oi
θ j = θ j + (l) Err j
¨  Case
updating – for each training example
¨  Epoch
updating – for each epoch (an
iteration through a training set)
80
40
Backpropagation
n 
n 
Repeat the training process (i.e., propagation forward etc.)
Until
¨  All the changes of wij in the previous epoch is below some
specified threshold, or
¨  % samples misclassified in the previous epoch is below a
threshold, or
¨  A pre-specified number of epochs has expired
81
Pruning and Rule Extraction
n 
A major drawback of Neural Net learning is its model is hard to interpret
n 
Fully connected network is hard to articulate à network pruning
¨  Remove weighted links that do not decrease classification
accuracy of the network
n 
Extracting rules from a trained pruned network
¨  Discretize activation values; replace individual activation value by the
cluster average maintaining the network accuracy
¨  Derive rules relating activation value and output using the
discretized activation values to enumerate the output
¨  Similarly, find the rule relating the input and activation value
¨  Combine the above two to have rules relating the output to input
82
41
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
83
Other Classification Methods
n 
Support vector machine
Classification-based association
K-nearest neighbor classifier
n 
Case-based reasoning
n 
Soft computing approach
¨ Genetic algorithm
¨ Rough set approach
¨ Fuzzy set approaches
n 
n 
84
42
Support vector machine(SVM).
n 
Basic idea
¨  find
n 
n 
the best boundary between classes
¨  build classifier on top of them
Boundary points are called support vectors
Two basic types:
¨ 
¨ 
Linear SVM
Non-linear SVM
85
SVM
Small Margin
Large Margin
Support Vectors
86
43
Optimal Hyper plane: separable case
Hyper plane:
n 
n 
n 
n 
Class 1 and class 2 are separable.
xT β + β 0 = 0
Crossed points are support vectors
– points that maximize the margin
between the two classes
X
Given a training set of N pairs
of xi with label yi
X
X
Goal is to maximize C subject to
a constraint:
X
yi ( xiT β + β 0 ) > C , i = 1,..., N
C
unit vector
bias
87
Non-separable case
n 
When the data set is non-separable
¨ 
xT β + β 0 = 0
assign weight to each support vector
changing the constraint to be
yi (xiT β + β0 ) > C(1− ξ i ),
X
N
where
ξ*
X
∀i, ξi > 0, ∑ ξi < const.
X
i =1
X
C
88
44
General SVM
n 
n 
n 
No good optimal linear classifier
Can we do better?
à A non-linear boundary
(circled points are support vectors)
Idea:
Map original problem space into
a larger space so that the boundary
in the new space is linearly separable
¨  Use Kernel (do not always exist) to compute distances in
the new space (could be infinite-dimensional)
¨ 
89
Example
n 
When the data is not linearly separable
¨ 
Project the data to high dimensional space where it is linearly
separable and then we can use linear SVM
(0,1) +
+
-
+
-1
0
+1
(0,0)
+
(1,0)
90
45
Performance of SVM
n 
For general support vector machine
E[ P(error )] ≤ E[(# of support vectors)/
(# training samples)]
n 
SVM has been very successful in lots of applications
91
SVM vs. Neural Network
n 
SVM
n 
Neural Network
¨ 
Relatively new concept
¨ 
Quite Old
¨ 
Nice Generalization
properties
¨ 
¨ 
Hard to learn – learned in
batch mode using
quadratic programming
techniques
Generalizes well but
doesn’t have strong
mathematical foundation
¨ 
Can easily be learned in
incremental fashion
¨ 
To learn complex functions
– use multilayer perceptron
(non-trivial)
¨ 
Using kernels can learn
very complex functions
92
46
Open problems of SVM
n 
How to choose appropriate Kernel function
¨ 
Different Kernel gives different results, although generally
better than hyper planes
n 
For very large training set, support vectors might
be of large size. Speed thus becomes a bottleneck
n 
An optimal design for multi-class SVM classifier
93
SVM Related Links
n 
http://svm.dcs.rhbnc.ac.uk/
n 
http://www.kernel-machines.org/
n 
C. J. C. Burges.
A Tutorial on Support Vector Machines for Pattern Recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
n 
SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light
94
47
Other Classification Methods
n 
Support vector machine
Classification-based association
K-nearest neighbor classifier
n 
Case-based reasoning
n 
Soft computing approach
¨ Genetic algorithm
¨ Rough set approach
¨ Fuzzy set approaches
n 
n 
95
Association-Based Classification
n 
Several methods
¨  ARCS: Association rule + clustering (Lent et al’97)
n  It beats C4.5 in (mainly) scalability and also accuracy
¨  Associative classification: (Liu et al’98)
n  It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
¨  CAEP (Classification by aggregating emerging
patterns) (Dong et al’99)
n 
n 
Emerging patterns (EPs): the itemsets whose support
increases significantly from one class to another
Mine EPs based on minimum support and growth rate
96
48
Instance-Based Methods
n 
Instance-based learning:
¨  Delay generalization until new instance must be classified model
(“lazy evaluation” or “lazy learning”)
¨  Note: “eager learning” (e.g., decision tree) constructs generalization
model before classifying a new sample
n 
Thus, lazy learning gives fast training but slow classification
n 
Examples of lazy learning:
¨ 
¨ 
¨ 
n 
k-nearest neighbor approach
Locally weighted regression
Case-based reasoning
Requires efficient indexing techniques
97
k-nearest neighbor classifiers
n 
n 
A training sample = point in n-dimensional space
When an unknown sample x is given for classification
Search for k training samples (k “nearest neighbors”) that are
closest to x
E.g., “closeness” ~ Euclidean distance
¨  For discrete value class, x is assigned to the most common class
among its k “nearest neighbors”
¨  For continuous-valued class (e.g., in prediction), x is assigned to
the average value of the continuous-valued class associated with
the k nearest neighbors
¨ 
n 
Potentially expensive when number of k-nearest
neighbors is large
98
49
k-nearest neighbor classifiers (cont)
n 
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors according to
1
their distance to the query point xq - e.g., w ≡
d ( xq , xi )2
n  giving greater weight to closer neighbors
¨  Similarly, for real-valued target functions
¨ 
n 
n 
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes.
¨  To overcome it, axes stretch or elimination of the least relevant
attributes.
99
Case-Based Reasoning (CBR)
n 
Similar to k-nearest neighbor
n 
But …A sample is a “case” (complex symbolic description) not a
point in Euclidean space
Idea: given a new case to classify
¨ 
¨ 
¨ 
n 
n 
Instance-based learner (lazy learner) & Analyze similar samples
CBR searches a training case that is identical
à gives the stored solution
Otherwise find cases with similar components
à modify stored solutions to be a solution for a new case
Requires background knowledge, tight coupling between case
retrieval, knowledge-based reasoning and problem solving
Challenges:
¨ 
¨ 
Similarity metrics
Efficient techniques for indexing training cases and combining solutions
100
50
Genetic Algorithms (GA)
n 
Based on ideas of natural evolution
Create an initial population consisting of randomly generated rules
- represented by bit strings
¨  Generate rule offsprings by applying genetic operators
¨ 
n 
n 
¨ 
Form a new population containing the fittest rules
n 
¨ 
n 
Crossover - swap substrings from pairs of rules
Mutation - randomly selected bits are inverted
The fitness of a rule is measured by its classification accuracy on a set
of training examples
Repeat process until the population is “evolved” to satisfy fitness
threshold
GA facilitates parallelization and popular for solving
optimization problems
101
Rough Set Approach
n 
n 
Rough sets are used for “roughly” define equivalent classes
in the training data set
A rough set for a given class C is approximated by
¨ 
¨ 
n 
a lower approximation (samples certain to be in C)
an upper approximation (samples indescribable as not be in C)
Rough sets can also be used for relevance analysis
¨ 
Finding the minimal subsets of attributes is NP-hard
Rectangles ~ equivalent classes
102
51
Fuzzy Set
n 
Fuzzy logic
¨ 
n 
uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
Fuzzy logic allows classification of high level abstraction
¨ 
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories {low, medium, high}
with fuzzy values calculated
For a given new sample, more than one fuzzy value may apply
¨  Each applicable rule contributes a vote for membership in the
categories
¨  Typically, the truth values for each predicted category are
summed/combined
¨ 
103
Outline
n 
n 
n 
n 
Overview
¨  Tasks
¨  Evaluation
¨  Issues
Classification techniques
¨  Decision Tree
¨  Bayesian-based
¨  Neural network
¨  Others
Prediction techniques
Summary
104
52
What Is Prediction?
n 
Prediction is similar to classification
¨  First,
construct a model then use it to predict unknown
value
¨  Most
deal with modeling of continuous-valued
functions
¨  Major
method for prediction is regression
n  Linear
and multiple regression
n  Non-linear
regression
105
Regression Models
n 
Linear regression: Data are modeled using a straight line e.g.,
¨ 
¨ 
¨ 
¨ 
Y = α + β X, where Y = response variable, X = predictor variable
Assume variance of Y to be constant
α and β are regression coefficients are to be estimated
using the least squares method (errors between actual data and estimated
line are minimized)
n 
Multiple regression: Extended linear model to more than one
predictor value, i.e., Y = b0 + b1 X1 + b2 X2
n 
Nonlinear regression: many can be transformed into linear
Log-linear models: approximate discrete probability distributions
n 
106
53
Summary
n 
Classification is an extensively studied problem (mainly in statistics,
machine learning & neural networks)
n 
Classification is probably one of the most widely used data mining
techniques with a lot of extensions
n 
Scalability is still an important issue for database applications: thus
combining classification with database techniques should be a
promising topic
n 
Research directions: classification of non-relational data, e.g., text,
spatial, multimedia, etc..
107
References
n 
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation
Computer Systems, 13, 1997.
n 
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, 2(2): 121-168, 1998.
n 
n 
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling
machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
39-44, Montreal, Canada, August 1995.
n 
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI
Conf., pages 601-606, AAAI Press, 1994.
n 
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427,
New York, NY, August 1998.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh,
BOAT -- Optimistic Decision Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania,
1999
n 
108
54
References (cont)
n 
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on
Data Engineering (RIDE'97), Birmingham, England, April 1997.
n 
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc. 1998 Int.
Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug. 1998.
n 
W. Li, J. Han, and J. Pei,
CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, , Proc.
2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001.
n 
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction
detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159.
Blackwell Business, Cambridge Massechusetts, 1994.
n 
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
(EDBT'96), Avignon, France, March 1996.
109
References (cont)
n 
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
n 
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey,
Data Mining and Knowledge Discovery 2(4): 345-389, 1998
n 
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
n 
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence
(AAAI'96), 725-730, Portland, OR, Aug. 1996.
n 
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In
Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998.
n 
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In
Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996.
n 
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction
Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman,
1991.
n 
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
110
55