Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lecture 6
Classification and Prediction
(Part A)
Zhou Shuigeng
May 6, 2007
2007-6-23
Data Mining: Tech. & Appl.
1
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
2
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
3
Predictive Modeling
Goal: learn a mapping: y = f(x;θ)
Need:
A model structure
A score function
An optimization strategy
Categorical y ∈ {c1,…,cm}: classification
Real-valued y:
regression
Note: usually assume {c1,…,cm} are mutually exclusive
and exhaustive
2007-6-23
Data Mining: Tech. & Appl.
4
Classification vs. Prediction
Classification:
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
2007-6-23
Data Mining: Tech. & Appl.
5
Two Classes of Classifiers
2007-6-23
Discriminative Classification
Provide decision surface or boundary to separate
out the different classes
In real life, it is often impossible to separate out
the classes perfectly
Instead, seek function f(x;θ) that maximizes some
measure of separation between the classes. We
call f(x; θ) the discriminant function.
Probabilistic Classification
Focus on modeling the probability distribution of
each class
Data Mining: Tech. & Appl.
6
Discriminative vs. Probabilistic Classification
D1
D3
D
2
2007-6-23
Data Mining: Tech. & Appl.
7
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that
are correctly classified by the model
Test set is independent of training set, otherwise overfitting will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
2007-6-23
Data Mining: Tech. & Appl.
8
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAM E
M ike
M ary
Bill
Jim
Dave
Anne
2007-6-23
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Data Mining: Tech. & Appl.
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
9
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAM E
Tom
M erlisa
George
Joseph
2007-6-23
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Data Mining: Tech. & Appl.
Tenured?
10
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
2007-6-23
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
Data Mining: Tech. & Appl.
11
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
12
Issues Regarding Classification and
Prediction (1): Data Preparation
Data cleaning
Relevance analysis (feature selection)
Preprocess data in order to reduce noise and
handle missing values
Remove the irrelevant or redundant attributes
Data transformation
2007-6-23
Generalize and/or normalize data
Data Mining: Tech. & Appl.
13
Issues regarding classification and prediction
(2): Evaluating Classification Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provided by the model
Goodness of rules
decision tree size
compactness of classification rules
2007-6-23
Data Mining: Tech. & Appl.
14
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
Summary
2007-6-23
Data Mining: Tech. & Appl.
15
Feature Selection - Definition
Attempt to select the minimal size subset of
features according to certain criteria:
classification accuracy is at least not
significantly dropped
Resulting class distribution given only values
for the selected features is as close as
possible to the original class distribution
given all features
2007-6-23
Data Mining: Tech. & Appl.
16
Feature selection: dimensions
Evaluation Measure
Accuracy
Consistency
Classic
Complete
Heuristic Nondeterministic
Search strategy
Forward
Backward
Random
Generation Scheme
2007-6-23
Data Mining: Tech. & Appl.
17
Four Basic Steps
Original
Feature Set
1
Generation
2
Subset
Evaluation
Goodness of
the subset
No
Stopping
Criterion
3
2007-6-23
Data Mining: Tech. & Appl.
Yes
Learning/
testing
4
18
Search Procedure - Approaches
Complete
complete search for
optimal subset
Guaranteed optimality
Able to backtrack
Random
Setting a maximum
number of iterations
possible
Optimality depends on
values of some
parameters
2007-6-23
Heuristic
“hill-climbing”
iterate, then remaining
features yet to be
selected/rejected are
considered for
selection/rejection
simple to implement
fast in producing
results
produce sub-optimal
results
Data Mining: Tech. & Appl.
19
Evaluation Functions - Method Types
Filter methods
Distance Measures
Information Measures
Dependence Measures
Consistency Measures
Wrapper Methods
2007-6-23
Classifier Error Rate Measures
Data Mining: Tech. & Appl.
20
Some Measures for Text Classification
Information gain
IG(t ) =
∑
c∈C
( P(c, t ) log(
P(c, t )
P(c, t )
) + P(c, t ) log(
)).
P(c) P(t )
P(c) P(t )
Mutual information
MI(t, c) = log Pr (t | c) / Pr (t ).
Chi-square
( P(c, f ) P(c , f ) − P(c, f ) P(c , f ))
χ (c, f ) =
.
P(c) P( f ) P(c ) P( f )
2
2007-6-23
Data Mining: Tech. & Appl.
21
Problem with the Filter Model
2007-6-23
Assess merits of features from only the
data and ignores the classification
algorithm
generated feature subset may not be
optimal for the target classification
algorithm
Data Mining: Tech. & Appl.
22
Wrapper Model
Use actual target classification algorithm to
evaluate accuracy of each candidate subset
Evaluation Criteria
Classifier error rate
Generation method can be heuristic,
complete or random
2007-6-23
Data Mining: Tech. & Appl.
23
Wrapper Compared to Filter
& higher accuracy
' higher computation cost
' lack generality
2007-6-23
Data Mining: Tech. & Appl.
24
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
25
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
2007-6-23
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
Data Mining: Tech. & Appl.
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
26
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
2007-6-23
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Data Mining: Tech. & Appl.
27
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-andconquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
2007-6-23
Data Mining: Tech. & Appl.
28
Decision Tree Induction Algorithms
Many Algorithms
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
2007-6-23
Data Mining: Tech. & Appl.
29
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
arbitrary tuple
m
I( s1,s 2,...,s m ) = − ∑
i =1
si
si
log 2
s
s
entropy of attribute A with values {a1,a2,…,av}
s1 j + ...+ smj
I ( s1 j ,...,smj )
s
j =1
v
E(A)= ∑
information gained by branching on attribute A
Gain(A) = I(s 1, s 2 ,..., sm) − E(A)
2007-6-23
Data Mining: Tech. & Appl.
30
Attribute Selection by Information
Gain Computation
Class P: buys_computer =
“yes”
g Class N: buys_computer = “no”
g I(p, n) = I(9, 5) =0.940
g Compute the entropy for age:
g
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
age
income student credit_rating
<=30
high
no
fair
<=30
high
no
excellent
31…40 high
no
fair
>40
medium
no
fair
>40
low
yes
fair
>40
low
yes
excellent
31…40 low
yes
excellent
<=30
medium
no
fair
<=30
low
yes
fair
>40
medium
yes
fair
<=30
medium
yes
excellent
31…40 medium
no
excellent
31…40 high
yes
fair
>402007-6-23
medium
no
excellent
4
5
E ( age ) =
I ( 2 ,3) +
I ( 4,0 )
14
14
5
+
I (3, 2 ) = 0 .694
14
5
I ( 2,3) means “age <=30” has 5
14
out of 14 samples, with 2
yes’es and 3 no’s. Hence
Gain ( age ) = I ( p , n ) − E ( age ) = 0.246
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Data Mining: Tech. & Appl.
no
Similarly,
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
31
Other Attribute Selection Measures
Gini index (CART, IBM IntelligentMiner)
2007-6-23
All attributes are assumed continuous-valued
Assume there exist several possible split values
for each attribute
May need other tools, such as clustering, to get
the possible split values
Can be modified for categorical attributes
Data Mining: Tech. & Appl.
32
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini
index, gini(T) is defined as gini (T ) = 1 − n 2
∑ p j
j =1
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
gini(T) is defined as
gini
split
(T ) = N 1 gini (T 1) + N 2 gini (T 2 )
N
N
The attribute provides the smallest ginisplit(T) is chosen
to split the node (need to enumerate all possible splitting
points for each attribute)
2007-6-23
Data Mining: Tech. & Appl.
33
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer =
“no”
2007-6-23
Data Mining: Tech. & Appl.
34
Avoid Overfitting in Classification
Overfitting: An induced tree may overfit the training
data
Too many branches, some may reflect anomalies
due to noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the “best pruned tree”
2007-6-23
Data Mining: Tech. & Appl.
35
Approaches to Determine
the Final Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
Use minimum description length (MDL) principle
2007-6-23
halting growth of the tree when the encoding is
minimized
Data Mining: Tech. & Appl.
36
Enhancements to basic decision
tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
2007-6-23
Create new attributes based on existing ones that
are sparsely represented
This reduces fragmentation, repetition, and
replication
Data Mining: Tech. & Appl.
37
Classification in Large Databases
Classification—a classical problem extensively studied
by statisticians and machine learning researchers
Scalability: Classifying data sets with millions of
examples and hundreds of attributes with reasonable
speed
Why decision tree induction in data mining?
relatively faster learning speed (than other
classification methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other
methods
2007-6-23
Data Mining: Tech. & Appl.
38
Scalable Decision Tree Induction
Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.)
builds an index for each attribute and only class list
and the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
integrates tree splitting and tree pruning: stop
growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
separates the scalability aspects from the criteria
that determine the quality of the tree
builds an AVC-list (attribute, value, class label)
2007-6-23
Data Mining: Tech. & Appl.
39
Data Cube-Based DecisionTree Induction
Integration of generalization with decision-tree
induction (Kamber et al’97).
Classification at primitive concept levels
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification
Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.
2007-6-23
Data Mining: Tech. & Appl.
40
Presentation of Classification Results
2007-6-23
Data Mining: Tech. & Appl.
41
Visualization of a Decision Tree in
SGI/MineSet 3.0
2007-6-23
Data Mining: Tech. & Appl.
42
Interactive Visual Mining by PerceptionBased Classification (PBC)
2007-6-23
Data Mining: Tech. & Appl.
43
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
44
Bayesian Classification
A probabilistic framework for solving classification
problems
Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
2007-6-23
Data Mining: Tech. & Appl.
45
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the
observed data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given
that the hypothesis holds
2007-6-23
Data Mining: Tech. & Appl.
46
Bayesian Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
P(H | X ) = P( X | H )P(H )
P( X )
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h
≡ arg max P(h | D) = arg max P(D | h)P(h).
MAP h∈H
h∈H
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
2007-6-23
Data Mining: Tech. & Appl.
47
Basic Formulas for Probabilities
Product Rule: probability P(A ∧ B) of a conjunction of two events
A and B:
P(A ∧ B) = P(A | B) P(B) = P(B | A) P(A)
Sum Rule: probability of a disjunction of two events A and B:
P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
Theorem of total probability: if events A1,…, An are mutually
exclusive with
2007-6-23
, then
Data Mining: Tech. & Appl.
48
Naïve Bayes Classifier (I)
2007-6-23
Along with decision trees, neural networks, nearest
neighbors, one of the most practical learning
methods
When to use
Moderate or large training set available
Attributes that describe instances are
conditionally independent given classification
Successful applications:
Diagnosis
Classifying text documents
Data Mining: Tech. & Appl.
49
Naïve Bayes Classifier (II)
Assume target function f : X → V, where each instance x
described by attributes <a1, a2 … an>.
Most probable value of f(x) is:
Na ï ve Bayes assumption:
which gives
Naive Bayes classifier:
2007-6-23
Data Mining: Tech. & Appl.
50
Naïve Bayes Algorithm
Naïve Bayes Learn(examples)
For each target value vj
P(vj) ← estimate P^ (vj)
For each attribute value ai of each attribute a
P(ai |vj) ← estimate P^ (ai |vj)
Classify New Instance(x)
2007-6-23
Data Mining: Tech. & Appl.
51
Naïve Bayes: Subtleties (I)
Conditional independence assumption is often
violated
...but it works surprisingly well anyway. Note don’t
need estimated posteriors
to be correct;
need only that
see [Domingos & Pazzani, 1996] for analysis
Naive Bayes posteriors often unrealistically close
to 1 or 0
Pedro Domingos, Michael Pazzan. Beyond Independence: Conditions for the
Optimality of the Simple Bayesian Classifier (1996).
2007-6-23
Data Mining: Tech. & Appl.
52
Naïve Bayes: Subtleties (II)
What if none of the training instances with target value vj
have attribute value ai? Then
Typical solution is Bayesian estimate for
where
n is number of training examples for which v = vi,
nc number of examples for which v = vj and a = ai
p is prior estimate for
m is weight given to prior (i.e. number of “virtual”
examples)
2007-6-23
Data Mining: Tech. & Appl.
53
Training dataset
age
<=30
Class:
C1:buys_computer= <=30
30…40
‘yes’
>40
C2:buys_computer=
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
2007-6-23
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Tech. & Appl.
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
54
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
2007-6-23
Data Mining: Tech. & Appl.
55
NBC: Another Example
2007-6-23
Data Mining: Tech. & Appl.
56
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence , therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes
etc
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
2007-6-23
Data Mining: Tech. & Appl.
57
Bayesian Belief Networks
Interesting because:
Naïve Bayes assumption of conditional
independence too restrictive
But it’s intractable without some such
assumptions...
Bayesian Belief networks describe conditional
independence among subsets of variables
→ allows combining prior knowledge about
(in)dependencies among variables with observed
training data (also called Bayes Nets)
2007-6-23
Data Mining: Tech. & Appl.
58
Conditional Independence
Definition: X is conditionally independent of Y given
Z if the probability distribution governing X is
independent of the value of Y given the value of Z;
that is, if
(∀xi, yj, zk) P (X= xi|Y= yj, Z= zk) = P (X= xi|Z= zk)
more compactly, we write
P (X|Y, Z) = P (X|Z)
Example: Thunder is conditionally independent of
Rain, given Lightning
P(Thunder|Rain, Lightning) = P(Thunder|Lightning)
2007-6-23
Data Mining: Tech. & Appl.
59
Bayesian Belief Network (I)
Provides graphical representation of
probabilistic relationships among a set of
random variables
Consists of:
A directed acyclic graph (dag)
2007-6-23
Node corresponds to a variable
Arc corresponds to dependence
relationship between a pair of variables
A probability table associating each node to its
immediate parent
Data Mining: Tech. & Appl.
60
Bayesian Belief Network (II)
Network represents a set of conditional
independence assertions:
2007-6-23
A node in a Bayesian network is conditionally
independent of all of its nondescendants, if its
parents are known
Data Mining: Tech. & Appl.
61
Probability Table
If X does not have any parents, table
contains prior probability P(X)
If X has only one parent (Y), table
contains conditional probability P(X|Y)
If X has multiple parents (Y1, Y2,…, Yk),
table contains conditional probability
P(X|Y1, Y2,…, Yk)
2007-6-23
Data Mining: Tech. & Appl.
62
Example of Bayesian Belief Network
2007-6-23
Data Mining: Tech. & Appl.
63
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more
network variables, given observed values of others?
Bayes net contains all information needed for this inference
If only one variable with unknown value, easy to infer it
In general case, problem is NP hard
In practice, can succeed in many cases
2007-6-23
Exact inference methods work well for some network
structures
Monte Carlo methods “simulate” the network randomly to
calculate approximate solutions
Data Mining: Tech. & Appl.
64
Learning of Bayesian Networks
Several variants of this mining task
Network structure might be known or
unknown
Training examples might provide values of all
network variables, or just some
If structure known and observe all variables
Then it’s easy as training a Naive Bayes
classifier
2007-6-23
Data Mining: Tech. & Appl.
65
Summary: Bayesian Belief Networks
Combine prior knowledge with observed data
Impact of prior knowledge (when correct!) is to
lower the sample complexity
Active research area
Extend from boolean to real-valued variables
Parameterized distributions instead of
tables
Extend to first-order instead of
propositional systems
More effective inference methods
…
2007-6-23
Data Mining: Tech. & Appl.
66
Outline
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
67
Classification Accuracy: Estimating
Error Rates
Partition: Training-and-testing
used for data set with large number of samples
Cross-validation
use two independent data sets, e.g., training set
(2/3), test set(1/3)
divide the data set into k subsamples
use k-1 subsamples as training data and one subsample as test data—k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
2007-6-23
for small size data
Data Mining: Tech. & Appl.
68
Bagging and Boosting
General idea
Training data
Classification method (CM)
Classifier C
Altered Training data
Altered Training data
……..
Aggregation ….
2007-6-23
CM
Classifier C1
CM
Data Mining: Tech. & Appl.
Classifier C2
Classifier C*
69
Bagging
Given a set S of s samples
Generate a bootstrap sample T from S. Cases in S may
not appear in T or may appear more than once.
Repeat this sampling procedure, getting a sequence of k
independent training sets
A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using
the same classification algorithm
To classify an unknown sample X,let each classifier
predict or vote
The Bagged Classifier C* counts the votes and assigns
X to the class with the “most” votes
2007-6-23
Data Mining: Tech. & Appl.
70
Boosting Technique — Algorithm
Assign every example an equal weight 1/N
For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the
examples based on the error . Each classifier is
dependent on the previous ones. Samples that are
incorrectly predicted are weighted more heavily
(t+1) to sum to 1 (weights assigned to
Normalize w
different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with each
hypothesis weighted according to its accuracy on the
training set
2007-6-23
Data Mining: Tech. & Appl.
71
Bagging and Boosting
Experiments with a new boosting
algorithm
Bagging Predictors
Freund et al (AdaBoost )
Brieman
Boosting Naïve Bayesian Learning on large
subset of MEDLINE
2007-6-23
W. Wilbur
Data Mining: Tech. & Appl.
72
Multiple Classifiers Combination
Parallel
combination
Sequential
combination
2007-6-23
Data Mining: Tech. & Appl.
73
References (1)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned
data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and
Data Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc.
1994 AAAI Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,
pages 416-427, New York, NY, August 1998.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision
Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999
2007-6-23
Data Mining: Tech. & Appl.
74
References (2)
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision
tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop
Research Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997.
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining.
Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY,
Aug. 1998.
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01),
San Jose, CA, Nov. 2001.
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. (EDBT'96), Avignon, France, March 1996.
2007-6-23
Data Mining: Tech. & Appl.
75
References (3)
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDiciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY,
August 1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay,
India, Sept. 1996.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
2007-6-23
Data Mining: Tech. & Appl.
76