Download What is Classification/Prediction?

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Lecture 6
Classification and Prediction
(Part A)
Zhou Shuigeng
May 6, 2007
2007-6-23
Data Mining: Tech. & Appl.
1
Outline
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
2
Outline
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
3
Predictive Modeling
„
„
„
Goal: learn a mapping: y = f(x;θ)
Need:
„ A model structure
„ A score function
„ An optimization strategy
„ Categorical y ∈ {c1,…,cm}: classification
„ Real-valued y:
regression
Note: usually assume {c1,…,cm} are mutually exclusive
and exhaustive
2007-6-23
Data Mining: Tech. & Appl.
4
Classification vs. Prediction
„
„
„
Classification:
„ predicts categorical class labels (discrete or nominal)
„ classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
„ models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
„ credit approval
„ target marketing
„ medical diagnosis
„ treatment effectiveness analysis
2007-6-23
Data Mining: Tech. & Appl.
5
Two Classes of Classifiers
„
„
2007-6-23
Discriminative Classification
„ Provide decision surface or boundary to separate
out the different classes
„ In real life, it is often impossible to separate out
the classes perfectly
„ Instead, seek function f(x;θ) that maximizes some
measure of separation between the classes. We
call f(x; θ) the discriminant function.
Probabilistic Classification
„ Focus on modeling the probability distribution of
each class
Data Mining: Tech. & Appl.
6
Discriminative vs. Probabilistic Classification
D1
D3
D
2
2007-6-23
Data Mining: Tech. & Appl.
7
Classification—A Two-Step Process
„
„
Model construction: describing a set of predetermined classes
„ Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
„ The set of tuples used for model construction is training set
„ The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
„ Estimate accuracy of the model
„ The known label of test sample is compared with the
classified result from the model
„ Accuracy rate is the percentage of test set samples that
are correctly classified by the model
„ Test set is independent of training set, otherwise overfitting will occur
„ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
2007-6-23
Data Mining: Tech. & Appl.
8
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAM E
M ike
M ary
Bill
Jim
Dave
Anne
2007-6-23
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Data Mining: Tech. & Appl.
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
9
Classification Process (2): Use
the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAM E
Tom
M erlisa
George
Joseph
2007-6-23
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Data Mining: Tech. & Appl.
Tenured?
10
Supervised vs. Unsupervised
Learning
„
Supervised learning (classification)
„
„
„
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
„
„
2007-6-23
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
Data Mining: Tech. & Appl.
11
Outline
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
12
Issues Regarding Classification and
Prediction (1): Data Preparation
„
Data cleaning
„
„
Relevance analysis (feature selection)
„
„
Preprocess data in order to reduce noise and
handle missing values
Remove the irrelevant or redundant attributes
Data transformation
„
2007-6-23
Generalize and/or normalize data
Data Mining: Tech. & Appl.
13
Issues regarding classification and prediction
(2): Evaluating Classification Methods
„
„
„
„
„
„
Predictive accuracy
Speed and scalability
„ time to construct the model
„ time to use the model
Robustness
„ handling noise and missing values
Scalability
„ efficiency in disk-resident databases
Interpretability:
„ understanding and insight provided by the model
Goodness of rules
„ decision tree size
„ compactness of classification rules
2007-6-23
Data Mining: Tech. & Appl.
14
Outline
„
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
Summary
2007-6-23
Data Mining: Tech. & Appl.
15
Feature Selection - Definition
„
Attempt to select the minimal size subset of
features according to certain criteria:
„ classification accuracy is at least not
significantly dropped
„ Resulting class distribution given only values
for the selected features is as close as
possible to the original class distribution
given all features
2007-6-23
Data Mining: Tech. & Appl.
16
Feature selection: dimensions
Evaluation Measure
Accuracy
Consistency
Classic
Complete
Heuristic Nondeterministic
Search strategy
Forward
Backward
Random
Generation Scheme
2007-6-23
Data Mining: Tech. & Appl.
17
Four Basic Steps
Original
Feature Set
1
Generation
2
Subset
Evaluation
Goodness of
the subset
No
Stopping
Criterion
3
2007-6-23
Data Mining: Tech. & Appl.
Yes
Learning/
testing
4
18
Search Procedure - Approaches
„
„
Complete
„ complete search for
optimal subset
„ Guaranteed optimality
„ Able to backtrack
Random
„ Setting a maximum
number of iterations
possible
„ Optimality depends on
values of some
parameters
2007-6-23
„
Heuristic
„ “hill-climbing”
„ iterate, then remaining
features yet to be
selected/rejected are
considered for
selection/rejection
„ simple to implement
„ fast in producing
results
„ produce sub-optimal
results
Data Mining: Tech. & Appl.
19
Evaluation Functions - Method Types
„
Filter methods
„
„
„
„
„
Distance Measures
Information Measures
Dependence Measures
Consistency Measures
Wrapper Methods
„
2007-6-23
Classifier Error Rate Measures
Data Mining: Tech. & Appl.
20
Some Measures for Text Classification
„
Information gain
IG(t ) =
∑
c∈C
( P(c, t ) log(
P(c, t )
P(c, t )
) + P(c, t ) log(
)).
P(c) P(t )
P(c) P(t )
„
Mutual information
„
MI(t, c) = log Pr (t | c) / Pr (t ).
Chi-square
( P(c, f ) P(c , f ) − P(c, f ) P(c , f ))
χ (c, f ) =
.
P(c) P( f ) P(c ) P( f )
2
2007-6-23
Data Mining: Tech. & Appl.
21
Problem with the Filter Model
„
2007-6-23
Assess merits of features from only the
data and ignores the classification
algorithm
„ generated feature subset may not be
optimal for the target classification
algorithm
Data Mining: Tech. & Appl.
22
Wrapper Model
„
„
„
Use actual target classification algorithm to
evaluate accuracy of each candidate subset
Evaluation Criteria
„ Classifier error rate
Generation method can be heuristic,
complete or random
2007-6-23
Data Mining: Tech. & Appl.
23
Wrapper Compared to Filter
& higher accuracy
' higher computation cost
' lack generality
2007-6-23
Data Mining: Tech. & Appl.
24
Outline
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
25
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
2007-6-23
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
Data Mining: Tech. & Appl.
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
26
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
2007-6-23
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Data Mining: Tech. & Appl.
27
Algorithm for Decision Tree Induction
„
„
Basic algorithm (a greedy algorithm)
„ Tree is constructed in a top-down recursive divide-andconquer manner
„ At start, all the training examples are at the root
„ Attributes are categorical (if continuous-valued, they are
discretized in advance)
„ Examples are partitioned recursively based on selected
attributes
„ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
„ All samples for a given node belong to the same class
„ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
„ There are no samples left
2007-6-23
Data Mining: Tech. & Appl.
28
Decision Tree Induction Algorithms
„
Many Algorithms
„ Hunt’s Algorithm (one of the earliest)
„ CART
„ ID3, C4.5
„ SLIQ,SPRINT
2007-6-23
Data Mining: Tech. & Appl.
29
Attribute Selection Measure:
Information Gain (ID3/C4.5)
„
„
„
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
arbitrary tuple
m
I( s1,s 2,...,s m ) = − ∑
i =1
„
si
si
log 2
s
s
entropy of attribute A with values {a1,a2,…,av}
s1 j + ...+ smj
I ( s1 j ,...,smj )
s
j =1
v
E(A)= ∑
„
information gained by branching on attribute A
Gain(A) = I(s 1, s 2 ,..., sm) − E(A)
2007-6-23
Data Mining: Tech. & Appl.
30
Attribute Selection by Information
Gain Computation
Class P: buys_computer =
“yes”
g Class N: buys_computer = “no”
g I(p, n) = I(9, 5) =0.940
g Compute the entropy for age:
g
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
age
income student credit_rating
<=30
high
no
fair
<=30
high
no
excellent
31…40 high
no
fair
>40
medium
no
fair
>40
low
yes
fair
>40
low
yes
excellent
31…40 low
yes
excellent
<=30
medium
no
fair
<=30
low
yes
fair
>40
medium
yes
fair
<=30
medium
yes
excellent
31…40 medium
no
excellent
31…40 high
yes
fair
>402007-6-23
medium
no
excellent
4
5
E ( age ) =
I ( 2 ,3) +
I ( 4,0 )
14
14
5
+
I (3, 2 ) = 0 .694
14
5
I ( 2,3) means “age <=30” has 5
14
out of 14 samples, with 2
yes’es and 3 no’s. Hence
Gain ( age ) = I ( p , n ) − E ( age ) = 0.246
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
Data Mining: Tech. & Appl.
no
Similarly,
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
31
Other Attribute Selection Measures
„
Gini index (CART, IBM IntelligentMiner)
„
„
„
„
2007-6-23
All attributes are assumed continuous-valued
Assume there exist several possible split values
for each attribute
May need other tools, such as clustering, to get
the possible split values
Can be modified for categorical attributes
Data Mining: Tech. & Appl.
32
Gini Index (IBM IntelligentMiner)
„
If a data set T contains examples from n classes, gini
index, gini(T) is defined as gini (T ) = 1 − n 2
∑ p j
j =1
„
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split
data contains examples from n classes, the gini index
gini(T) is defined as
gini
„
split
(T ) = N 1 gini (T 1) + N 2 gini (T 2 )
N
N
The attribute provides the smallest ginisplit(T) is chosen
to split the node (need to enumerate all possible splitting
points for each attribute)
2007-6-23
Data Mining: Tech. & Appl.
33
Extracting Classification Rules from Trees
„
Represent the knowledge in the form of IF-THEN rules
„
One rule is created for each path from the root to a leaf
„
Each attribute-value pair along a path forms a conjunction
„
The leaf node holds the class prediction
„
Rules are easier for humans to understand
„
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer =
“no”
2007-6-23
Data Mining: Tech. & Appl.
34
Avoid Overfitting in Classification
„
„
Overfitting: An induced tree may overfit the training
data
„ Too many branches, some may reflect anomalies
due to noise or outliers
„ Poor accuracy for unseen samples
Two approaches to avoid overfitting
„ Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness
measure falling below a threshold
„ Difficult to choose an appropriate threshold
„ Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
„ Use a set of data different from the training
data to decide which is the “best pruned tree”
2007-6-23
Data Mining: Tech. & Appl.
35
Approaches to Determine
the Final Tree Size
„
Separate training (2/3) and testing (1/3) sets
„
Use cross validation, e.g., 10-fold cross validation
„
Use all the data for training
„
„
but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
Use minimum description length (MDL) principle
„
2007-6-23
halting growth of the tree when the encoding is
minimized
Data Mining: Tech. & Appl.
36
Enhancements to basic decision
tree induction
„
Allow for continuous-valued attributes
„
„
„
Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values
„
Assign the most common value of the attribute
„
Assign probability to each of the possible values
Attribute construction
„
„
2007-6-23
Create new attributes based on existing ones that
are sparsely represented
This reduces fragmentation, repetition, and
replication
Data Mining: Tech. & Appl.
37
Classification in Large Databases
„
„
„
Classification—a classical problem extensively studied
by statisticians and machine learning researchers
Scalability: Classifying data sets with millions of
examples and hundreds of attributes with reasonable
speed
Why decision tree induction in data mining?
„ relatively faster learning speed (than other
classification methods)
„ convertible to simple and easy to understand
classification rules
„ can use SQL queries for accessing databases
„ comparable classification accuracy with other
methods
2007-6-23
Data Mining: Tech. & Appl.
38
Scalable Decision Tree Induction
Methods in Data Mining Studies
„
„
„
„
SLIQ (EDBT’96 — Mehta et al.)
„ builds an index for each attribute and only class list
and the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)
„ constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
„ integrates tree splitting and tree pruning: stop
growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
„ separates the scalability aspects from the criteria
that determine the quality of the tree
„ builds an AVC-list (attribute, value, class label)
2007-6-23
Data Mining: Tech. & Appl.
39
Data Cube-Based DecisionTree Induction
„
„
Integration of generalization with decision-tree
induction (Kamber et al’97).
Classification at primitive concept levels
„
„
„
„
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification
„
Relevance analysis at multi-levels.
„
Information-gain analysis with dimension + level.
2007-6-23
Data Mining: Tech. & Appl.
40
Presentation of Classification Results
2007-6-23
Data Mining: Tech. & Appl.
41
Visualization of a Decision Tree in
SGI/MineSet 3.0
2007-6-23
Data Mining: Tech. & Appl.
42
Interactive Visual Mining by PerceptionBased Classification (PBC)
2007-6-23
Data Mining: Tech. & Appl.
43
Outline
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
44
Bayesian Classification
„
„
„
„
„
A probabilistic framework for solving classification
problems
Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical approaches to
certain types of learning problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
2007-6-23
Data Mining: Tech. & Appl.
45
Bayesian Theorem: Basics
„
„
„
„
„
„
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the
observed data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given
that the hypothesis holds
2007-6-23
Data Mining: Tech. & Appl.
46
Bayesian Theorem
„
Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
P(H | X ) = P( X | H )P(H )
P( X )
„
„
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h
≡ arg max P(h | D) = arg max P(D | h)P(h).
MAP h∈H
h∈H
„
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
2007-6-23
Data Mining: Tech. & Appl.
47
Basic Formulas for Probabilities
„
Product Rule: probability P(A ∧ B) of a conjunction of two events
A and B:
„
P(A ∧ B) = P(A | B) P(B) = P(B | A) P(A)
Sum Rule: probability of a disjunction of two events A and B:
P(A ∨ B) = P(A) + P(B) - P(A ∧ B)
„
Theorem of total probability: if events A1,…, An are mutually
exclusive with
2007-6-23
, then
Data Mining: Tech. & Appl.
48
Naïve Bayes Classifier (I)
„
„
„
2007-6-23
Along with decision trees, neural networks, nearest
neighbors, one of the most practical learning
methods
When to use
„ Moderate or large training set available
„ Attributes that describe instances are
conditionally independent given classification
Successful applications:
„ Diagnosis
„ Classifying text documents
Data Mining: Tech. & Appl.
49
Naïve Bayes Classifier (II)
„
„
Assume target function f : X → V, where each instance x
described by attributes <a1, a2 … an>.
Most probable value of f(x) is:
Na ï ve Bayes assumption:
which gives
Naive Bayes classifier:
2007-6-23
Data Mining: Tech. & Appl.
50
Naïve Bayes Algorithm
„
„
Naïve Bayes Learn(examples)
For each target value vj
P(vj) ← estimate P^ (vj)
For each attribute value ai of each attribute a
P(ai |vj) ← estimate P^ (ai |vj)
Classify New Instance(x)
2007-6-23
Data Mining: Tech. & Appl.
51
Naïve Bayes: Subtleties (I)
„
Conditional independence assumption is often
violated
„
„
„
...but it works surprisingly well anyway. Note don’t
need estimated posteriors
to be correct;
need only that
see [Domingos & Pazzani, 1996] for analysis
Naive Bayes posteriors often unrealistically close
to 1 or 0
Pedro Domingos, Michael Pazzan. Beyond Independence: Conditions for the
Optimality of the Simple Bayesian Classifier (1996).
2007-6-23
Data Mining: Tech. & Appl.
52
Naïve Bayes: Subtleties (II)
„
What if none of the training instances with target value vj
have attribute value ai? Then
Typical solution is Bayesian estimate for
where
„ n is number of training examples for which v = vi,
„ nc number of examples for which v = vj and a = ai
„ p is prior estimate for
„ m is weight given to prior (i.e. number of “virtual”
examples)
2007-6-23
Data Mining: Tech. & Appl.
53
Training dataset
age
<=30
Class:
C1:buys_computer= <=30
30…40
‘yes’
>40
C2:buys_computer=
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
2007-6-23
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining: Tech. & Appl.
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
54
Naïve Bayesian Classifier: Example
„
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
2007-6-23
Data Mining: Tech. & Appl.
55
NBC: Another Example
2007-6-23
Data Mining: Tech. & Appl.
56
Naïve Bayesian Classifier: Comments
„
„
„
Advantages
„ Easy to implement
„ Good results obtained in most of the cases
Disadvantages
„ Assumption: class conditional independence , therefore loss of
accuracy
„ Practically, dependencies exist among variables
„ E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes
etc
„ Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
„ Bayesian Belief Networks
2007-6-23
Data Mining: Tech. & Appl.
57
Bayesian Belief Networks
Interesting because:
„
„
„
Naïve Bayes assumption of conditional
independence too restrictive
But it’s intractable without some such
assumptions...
Bayesian Belief networks describe conditional
independence among subsets of variables
→ allows combining prior knowledge about
(in)dependencies among variables with observed
training data (also called Bayes Nets)
2007-6-23
Data Mining: Tech. & Appl.
58
Conditional Independence
„
Definition: X is conditionally independent of Y given
Z if the probability distribution governing X is
independent of the value of Y given the value of Z;
that is, if
(∀xi, yj, zk) P (X= xi|Y= yj, Z= zk) = P (X= xi|Z= zk)
more compactly, we write
P (X|Y, Z) = P (X|Z)
„
Example: Thunder is conditionally independent of
Rain, given Lightning
P(Thunder|Rain, Lightning) = P(Thunder|Lightning)
2007-6-23
Data Mining: Tech. & Appl.
59
Bayesian Belief Network (I)
„
„
Provides graphical representation of
probabilistic relationships among a set of
random variables
Consists of:
„
A directed acyclic graph (dag)
„
„
„
„
2007-6-23
Node corresponds to a variable
Arc corresponds to dependence
relationship between a pair of variables
A probability table associating each node to its
immediate parent
Data Mining: Tech. & Appl.
60
Bayesian Belief Network (II)
„
Network represents a set of conditional
independence assertions:
„
2007-6-23
A node in a Bayesian network is conditionally
independent of all of its nondescendants, if its
parents are known
Data Mining: Tech. & Appl.
61
Probability Table
„
„
„
If X does not have any parents, table
contains prior probability P(X)
If X has only one parent (Y), table
contains conditional probability P(X|Y)
If X has multiple parents (Y1, Y2,…, Yk),
table contains conditional probability
P(X|Y1, Y2,…, Yk)
2007-6-23
Data Mining: Tech. & Appl.
62
Example of Bayesian Belief Network
2007-6-23
Data Mining: Tech. & Appl.
63
Inference in Bayesian Networks
„
„
How can one infer the (probabilities of) values of one or more
network variables, given observed values of others?
„
Bayes net contains all information needed for this inference
„
If only one variable with unknown value, easy to infer it
„
In general case, problem is NP hard
In practice, can succeed in many cases
„
„
2007-6-23
Exact inference methods work well for some network
structures
Monte Carlo methods “simulate” the network randomly to
calculate approximate solutions
Data Mining: Tech. & Appl.
64
Learning of Bayesian Networks
„
Several variants of this mining task
„ Network structure might be known or
unknown
„
„
Training examples might provide values of all
network variables, or just some
If structure known and observe all variables
„ Then it’s easy as training a Naive Bayes
classifier
2007-6-23
Data Mining: Tech. & Appl.
65
Summary: Bayesian Belief Networks
„
„
„
Combine prior knowledge with observed data
Impact of prior knowledge (when correct!) is to
lower the sample complexity
Active research area
„ Extend from boolean to real-valued variables
„ Parameterized distributions instead of
tables
„ Extend to first-order instead of
propositional systems
„ More effective inference methods
„ …
2007-6-23
Data Mining: Tech. & Appl.
66
Outline
„
„
„
„
„
„
What is Classification/Prediction?
Issues Regarding Classification/Prediction
Feature Selection
Classification by Decision Tree Induction
Bayesian Classification
Classification Evaluation
2007-6-23
Data Mining: Tech. & Appl.
67
Classification Accuracy: Estimating
Error Rates
„
Partition: Training-and-testing
„
„
„
used for data set with large number of samples
Cross-validation
„
„
„
„
use two independent data sets, e.g., training set
(2/3), test set(1/3)
divide the data set into k subsamples
use k-1 subsamples as training data and one subsample as test data—k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
„
2007-6-23
for small size data
Data Mining: Tech. & Appl.
68
Bagging and Boosting
„
General idea
Training data
Classification method (CM)
Classifier C
Altered Training data
Altered Training data
……..
Aggregation ….
2007-6-23
CM
Classifier C1
CM
Data Mining: Tech. & Appl.
Classifier C2
Classifier C*
69
Bagging
„
„
„
„
„
„
Given a set S of s samples
Generate a bootstrap sample T from S. Cases in S may
not appear in T or may appear more than once.
Repeat this sampling procedure, getting a sequence of k
independent training sets
A corresponding sequence of classifiers C1,C2,…,Ck is
constructed for each of these training sets, by using
the same classification algorithm
To classify an unknown sample X,let each classifier
predict or vote
The Bagged Classifier C* counts the votes and assigns
X to the class with the “most” votes
2007-6-23
Data Mining: Tech. & Appl.
70
Boosting Technique — Algorithm
„
„
Assign every example an equal weight 1/N
For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) h(t) under w(t)
„ Calculate the error of h(t) and re-weight the
examples based on the error . Each classifier is
dependent on the previous ones. Samples that are
incorrectly predicted are weighted more heavily
(t+1) to sum to 1 (weights assigned to
„ Normalize w
different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with each
hypothesis weighted according to its accuracy on the
training set
„
„
2007-6-23
Data Mining: Tech. & Appl.
71
Bagging and Boosting
„
Experiments with a new boosting
algorithm
„
„
Bagging Predictors
„
„
Freund et al (AdaBoost )
Brieman
Boosting Naïve Bayesian Learning on large
subset of MEDLINE
„
2007-6-23
W. Wilbur
Data Mining: Tech. & Appl.
72
Multiple Classifiers Combination
Parallel
combination
Sequential
combination
2007-6-23
Data Mining: Tech. & Appl.
73
References (1)
„
„
„
„
„
„
„
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.
Data Mining and Knowledge Discovery, 2(2): 121-168, 1998.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned
data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and
Data Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc.
1994 AAAI Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,
pages 416-427, New York, NY, August 1998.
J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision
Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999
2007-6-23
Data Mining: Tech. & Appl.
74
References (2)
„
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision
tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop
Research Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997.
„
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining.
Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY,
Aug. 1998.
„
W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01),
San Jose, CA, Nov. 2001.
„
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
„
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data
mining. (EDBT'96), Avignon, France, March 1996.
2007-6-23
Data Mining: Tech. & Appl.
75
References (3)
„
„
„
„
„
„
„
„
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDiciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY,
August 1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for
data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay,
India, Sept. 1996.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
2007-6-23
Data Mining: Tech. & Appl.
76