Download no - CENG464

Document related concepts

Mixture model wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Gene expression programming wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

Machine learning wikipedia , lookup

Mathematical model wikipedia , lookup

Time series wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CENG 464
Introduction to Data Mining
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
2
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of the values of other
attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
– A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
3
Classification: Definition
Tid
10
4
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Prediction Problems: Classification vs.
Numeric Prediction
• Classification :
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown or missing
values
• Typical applications
– Credit/loan approval:
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
5
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
6
Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
7
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
8
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Illustrating Classification Task
Training and Test set are randomly sampled
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
supervised
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Attrib2
Attrib3
accuracy
Tid
Attrib1
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Deduction
10
Test Set
9
Find a mapping OR function that can predict class label of given tuple X
Classification Techniques
• Decision Tree based Methods
• Bayes Classification Methods
•
•
•
•
•
10
Rule-based Methods
Nearest-Neighbor Classifier
Artificial Neural Networks
Support Vector Machines
Memory based reasoning
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Root node:
Internal nodes: attribute test
conditions
Leaf nodes: class label
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
11
Married
Model: Decision Tree
Another Example of Decision Tree
MarSt
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
12
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that fits
the same data!
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
13
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
14
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
15
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
16
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
17
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
18
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
19
Refund Marital
Status
Married
NO
> 80K
YES
Assign Cheat to “No”
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
20
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm
– ID3, C4.5
– CART
– SLIQ,SPRINT
21
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
22
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
23
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split
24
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as
distinct values.
CarType
Family
Luxury
Sports
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
25
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small
Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Small,
Medium}
Size
{Large}
• What about this split?
26
OR
{Small,
Large}
{Medium,
Large}
Size
Size
{Medium}
{Small}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)
• consider all possible splits and finds the best cut
• can be more compute intensive
27
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
28
[25K,50K)
[50K,80K)
(ii) Multi-way split
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Which test condition is the best?
29
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5
C1: 5
30
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Attribute Selection-Splitting Rules Measures
(Measures of Node Impurity)
Provides a ranking for each attribute describing the given
training tuples. The attribute having the best score for the
measure is chosen as the splitting attribute for the given tuples.
• Information Gain-Entropy
• Gini Index
• Misclassification error
31
Brief Review of Entropy
m=2
32
Attribute Selection Measure:
Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

This attribute minimizes the information needed to classify the tuples in the
resulting partitions and reflects the least randomness or impurity in these
partitions

Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|

Expected information (entropy) needed to classify a tuple in D:
m
Info( D)   pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to classify D:

v
| Dj |
j 1
|D|
InfoA ( D)  
Information gained by branching on attribute A

 Info( D j )
Gain(A)  Info(D)  InfoA(D)
33
Attribute Selection: Information Gain


Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info( D)  I (9,5)  
Infoage ( D) 
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
5
I (3,2)  0.694
14
5
I (2,3) means “youth” has 5 out of 14
14 samples, with 2 yes’es and 3 no’s.
age
pi
ni I(pi, ni)
youth
2
3 0.971
middle-aged
4
0 0
senior
3
2 0.971
age
income student credit_rating
youth
high
no
fair
youth
high
no
excellent
middle-aged
high
no
fair
senior
medium
no
fair
senior
low
yes
fair
senior
low
yes
excellent
middle-aged
low
yes
excellent
youth
medium
no
fair
youth
low
yes
fair
senior
medium
yes
fair
youth
medium
yes
excellent
middle-aged
medium
no
excellent
missle-aged
high
yes
fair
34
senior
medium
no
excellent

5
4
I (2,3) 
I (4,0)
14
14
Hence
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Gain(age)  Info( D)  Infoage ( D)  0.246
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
Computing Information-Gain for
Continuous-Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
35
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
SplitInfo A ( D)  
j 1
| Dj |
| D|
 log 2 (
| Dj |
|D|
)
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
– gain_ratio(income) = 0.029/1.557 = 0.019
• The attribute with the maximum gain ratio is selected as the
splitting attribute
36
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as
n
gini( D)  1  p 2j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1|
|D2 |
gini(D1) 
gini(D2)
gini A (D) 
|D|
|D|
• Reduction in Impurity:
gini( A)  gini(D)  giniA(D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
37
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes”
and
5 in “no”
2
2
9 5
gini ( D)  1        0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low,
 10 
4
medium} and 4 in D2 gini
( D)   Gini ( D )   Gini ( D )
income{low, medium}
 14 
1
 14 
2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split
values
• Can be modified for categorical attributes
38
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions
and purity in both partitions
39
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
• G-statistic: has a close approximation to χ2 distribution
• MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
– The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
– CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
– Most give good results, none is significantly superior than others
40
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
41
Decision Tree Based Classification
• Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
42
Chapter 8. Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
43
Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
44
Bayes’ Theorem: Basics
• Total probability Theorem: P(B) 
M
P(B | A )P( A )
i
i
i 1

• Bayes’ Theorem: P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)
P(X)
45
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)
P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
46
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
i
i
P(C | X) 
i
P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i
i
i
needs to be maximized
47
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k
1
2
n
k 1
• This greatly reduces the computation cost: Only counts the
class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
g ( x,  ,  ) 
1
e
2 

( x )2
2 2
P ( X | C i )  g ( xk ,  Ci ,  Ci )
48
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age =youth,
Income = medium,
Student = yes
Credit_rating = Fair)
49
age income student credit_rating buys_computer
youth high
no fair
no
youth high
no excellent
no
middle-aged
high
no fair
yes
senior medium
no fair
yes
senior low
yes fair
yes
senior low
yes excellent
no
middle-aged
low
yes excellent
yes
youth medium
no fair
no
youth low
yes fair
yes
senior medium yes fair
yes
youth medium yes excellent
yes
middle-aged
medium
no excellent
yes
missle-aged
high
yes fair
yes
senior medium
no excellent
no
Naïve Bayes Classifier:
An Example
age
income student credit_rating
youth high
no fair
youth high
no excellent
middle-aged
high
no fair
senior medium
no fair
senior low
yes fair
senior low
yes excellent
middle-aged
low
yes excellent
youth medium
no fair
youth low
yes fair
senior medium
yes fair
youth medium
yes excellent
middle-aged
medium
no excellent
missle-aged
high
yes fair
senior medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “youth ” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
50
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
P( X | C i)

n
 P( x k | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their
“uncorrected” counterparts
51
Naïve Bayes Classifier: Comments
• Advantages
– Easy to implement
– Robust to noise
– Can handle null values
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
52
Chapter 8. Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
53
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R:
–
–
–
IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
If rule is satisfied by X, it covers the tupple, the rule is said to be triggered
If R1 is the rule satisfied, then the rule fires by returning the class
predictiın
• Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D|
accuracy(R) = ncorrect / ncovers
54
where D: training data set
Using IF-THEN Rules for Classification
• If more than one rule are triggered, need conflict resolution
– Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Rule ordering: prioritize rules beforehand
• Class-based ordering: classes are sorted in order of decreasing importance
like order of prevalence or misclassification cost per class. Within each class
rules are nor ordered
• Rule-based ordering (decision list): rules are organized into one long priority
list, according to some measure of rule quality like accuracy, coverage or size.
The first rule satisfying X fires class prediction, any other rule satisfying X is
ignored. Each rule in the list implies the negation of the rules that come
before itdifficult to interpret
• What if no rule is fired for X? default rule!
55
Rule Extraction from a Decision Tree




Rules are easier to understand than large trees
One rule is created for each path from the root to a leaf and
logically ANDed to form the rule antecedent
Each attribute-value pair along a path forms a conjunction:
the leaf holds the class prediction
Rules are mutually exclusive and exhaustive
 Mutually exclusive: no two rules will be triggered for the
same tuple
 Exhaustive: there is one rule for each possible attribute
value combinationno need for a default rule
56
Rule Extraction from a Decision Tree
age?
youth middle_aged
student?
no
no
yes
yes
senior
credit rating?
excellent
yes
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no
THEN buys_computer = no
IF age = young AND student = yes
THEN buys_computer = yes
IF age = middle_aged
THEN buys_computer = yes
IF age = senior AND credit_rating = excellent THEN buys_computer = no
IF age = senior AND credit_rating = fair
THEN buys_computer = yes
57
fair
yes
Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training
data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the rules are
removed
– Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
• Comp. w. decision-tree induction: learning a set of rules
simultaneously
58
Sequential Covering Algorithm
When learning a rule for a class, C, we would like the rule to cover all or most of
the training tuples of class C and none or few of the tuples from other classes
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 2
Examples covered
by Rule 1
Examples covered
by Rule 3
Positive
examples
59
How to Learn-One-Rule?
Two approaches:
Specialization:
• Start with the most general rule possible: empty ruleclass y
• Best attribute-value pair is added from list A into the antecedent
• Continue until rule performance measure cannot improve further
– If income=high THEN loan_decision=accept
– If income=high AND credit_rating=excellent THEN
loan_decision=accept
– Greedy algorithm: always add attribute –value pair which is
best at the moment
60
How to Learn-One-Rule?
Two approaches:
generalization
• Start with the randomly selected positive tuple and converted to
a rule that covers
– Tuple: (overcast, high,false,P) can be converted to a rule as
Outlook=overcast AND humidity=high AND windy=false class=P
• Choose one attribute-value pair and remove it sothat rule covers
more positive examples
• Repeat the process until the rule starts to cover negative
examples
61
How to Learn-One-Rule?
Rule-Quality measures:
used to decide if appending a test to the rule’s condition will result
in an improved rule: accuracy, coverage
•Consider R1 correctly classifies 38 0f 40 tuples whereas R2 covers 2
tuples and correctly classifies all: which rule is better? Accuracy?
•Different Measures: Foil-gain, likelihood ratio statistics, chisquare
statistics
62
How to Learn-One-Rule?
Rule-Quality measures: Foil-gain: checks if ANDing a new condition results in a
better rule
• considers both coverage and accuracy
– Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
FOIL _ Gain  pos'(log 2
pos'
pos
 log 2
)
pos' neg '
pos  neg
pos and neg are the # of positively and negatively covered tuples by R and
Pos’ and neg’ are the # of positively and negatively covered tuples by R’
• favors rules that have high accuracy and cover many positive tuples
• No test set for evaluating rules but Rule pruning is performed by removing a
condition
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
63
Nearest Neighbour Approach
• General Idea
– The Model: a set of training examples stored in memory
– Lazy Learning: delaying the decision to the time of
classification. In other words, there is no training!
– To classify an unseen record: compute its proximity to all
training examples and locate 1 or k nearest neighbours
examples. The nearest neighbours determine the class of the
record (e.g. majority vote)
– Rationale: “If it walks like a duck, quacks like a duck, and looks
like a duck, it probably is a duck”.
Nearest Neighbour Approach
• kNN Classification Algorithm
algorithm kNN (Tr: training set; k : integer; r :
data record) : Class
begin
for each training example t in Tr do
calculate proximity d(t, r) upon
descriptive attributes
end for;
select the top k nearest neighbours into set
D accordingly;
Class := majority class in D
return Class
Class():■
end;
Nearest Neighbour Approach
• PEBLS Algorithm
–
–
–
–
–
–
Class based similarity measure is used
A nearest neighbour algorithm (k = 1)
Examples in memory have weights (exemplars)
Simple training: assigning and refining weights
A different proximity measure
Algorithm outline:
1.
2.
3.
Build value difference tables for descriptive attributes (in preparation
of measuring distances between examples)
For each training, refine the weight of its nearest neighbour
Refine the weights of some training examples when classifying
validation examples
Nearest Neighbour Approach
• PEBLS: Value Difference Table
attribute A
A1
A2
…
Ai
….
Am
A1
A2
d(A1, A1) d(A1, A2)
d(A2, A1) d(A2, A2)
d(Ai, A1)
…
Aj
d(A1, Aj)
d(A2, Aj)
…
Am
d(A1, Am)
d(A2, Am)
d(Ai, A2)
d(Ai, Aj)
d(Ai, Am)
d(Am, A1) d(Am, A2)
d(Am, Aj)
d(Am, Am)
k
d (V1 ,V2 ) 
Civ1
C
t 1
v1

Civ2
r
Cv2
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
Class
N
N
P
P
P
N
P
N
P
P
P
P
P
N
Value Difference Table for Outlook
r: is set to 1.
Cv1: total number of examples with V1
Cv2: total number of examples with V2
Civ1: total number of examples with V1 and of class i
Civ2: total number of examples with V2 and of class i
sunny
overcast
rain
sunny
0
1.2
0.4
d ( sunny, overcast ) 
overcast
1.2
0
0.8
rain
0.4
0.8
0
24  30  33
1 .2
5 4 5 4 5 5
Nearest Neighbour Approach
• PEBLS: Distance Function
m
( X , Y )  wX wY

d ( xi , yi ) 2
i 1
where wX, wY: weights for X and Y, m: the number of attributes, xi, yi:
values of the ith attribute for X and Y.
wX 
T
C
where T: the total number of times that X is selected as the nearest
neighbour, C: the total number of times that X correctly classifies
examples.
Nearest Neighbour Approach
• PEBLS: Distance Function (Example)
ROW#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
Class
N
N
P
P
P
N
P
N
P
P
P
P
P
N
Value Difference Tables
Outlook
sunny
overcast
rain
sunny
0
1.2
0.4
Temperature
hot
mild
cool
hot
0
0.33
0.5
overcast
1.2
0
0.8
mild
0.33
0
0.33
rain
0.4
0.8
0
cool
0.5
0.33
0
Windy
TRUE
FALSE
TRUE
0
0.418
FALSE
0.418
0
Humidity
high
normal
high
0
0.857
normal
0.857
0
Assuming row1.weight = row2.weight = 1,
(row1, row2) = d(row1outlook, row2outlook)2 + d(row1temperature, row2temperature)2 +
d(row1humidity, row2humidity)2 + d(row1windy, row2windy)2
= d(sunny, sunny)2 + d(hot, hot)2 + d(high, high)2 + d(false, true)2
= 0 + 0 + 0 + (1/2)2 =1/4
Nearest Neighbour Approach
• PEBLS: Example
2
1 sunny hot high
false N
2
2 sunny hot high
true
N
1
2 sunny hot high
true
N
1
3 overcast hot high
false P
1
3 overcast hot high
false P
1
4 rain
mild
false P
1.5
4 rain
mild
false P
1.5
5 rain
cool normal false P
1.5
5 rain
cool normal false P
1.5
6 rain
cool normal true
N
2
6 rain
cool normal true
N
2
7 overcast cool normal true P
1
7 overcast cool normal true P
1
8 sunny mild high false N
2
8 sunny mild high false N
2
9 sunny cool normal false P
1
9 sunny cool normal false P
1
10 rain
1
10 rain
1
mild normal false P
11 sunny mild normal true
P
1
12 overcast mild high true
P
2
13 overcast hot normal false P
1
14 rain mild high
1
true
N
overcast hot high false
high
Unseen
record
high
mild normal false P
11 sunny mild normal true
P
1
12 overcast mild high true
P
2
13 overcast hot normal false P
1
14 rain mild high
1
true
N
overcast hot high false
false N
?
1 sunny hot high
P
Classifying an unseen record
After Training:
Artificial Neural Network Approach
– Our brains are made up of about 100 billion
tiny units called neurons.
– Each neuron is connected to thousands of
other neurons and communicates with them
via electrochemical signals.
– Signals coming into the neuron are received
via junctions called synapses, these in turn
are located at the end of branches of the
neuron cell called dendrites.
– The neuron continuously receives signals
from these inputs
– What the neuron does is sum up the inputs to
itself in some way and then, if the end result
is greater than some threshold value, the
neuron fires.
– It generates a voltage and outputs a signal
along something called an axon.
Artificial Neural Network Approach
• General Idea
– The Model: A network of connected artificial neurons
– Training: select a specific network topology and use the training
example to tune the weights attached on the links connecting
the neurons
– To classify an unseen record X, feed the descriptive attribute
values of the record into the network as inputs. The network
computes an output value that can be converted to a class label
Artificial Neural Network Approach
• Artificial Neuron (Unit)
i1
i2
w1
w2
w3
S
y
i3
Sum function:
Transformation function:
x = w1*i1 + w2*i2 + w3*i3
Sigmoid(x) = 1/(1+e-x)
Y
Sigm oid Function
X
Artificial Neural Network Approach





A neural network can have many hidden layers, but one layer is
normally considered sufficient
The more units a hidden layer has, the more capacity of pattern
recognition
The constant inputs can be fed into the units in the hidden and output
layers as inputs.
Network with links from lower layers to upper layersfeed-forward nw
Network with links between nodes of the same layerrecurrent nw
Artificial Neural Network Approach
• Artificial Neuron (Perceptron)
i1
i2
w1
w2
w3
S
y
i3
Sum function:
Transformation function:
x = w1*i1 + w2*i2 + w3*i3
Sigmoid(x) = 1/(1+e-x)
Y
Sigm oid Function
X
Artificial Neural Network Approach
• General Principle for Training an ANN
algorithm trainNetwork (Tr: training set) : Network
Begin
R = initial network with a particular topology;
initialise the weight vector with random values w(0);
repeat
for each training example t=<xi, yi> in Tr do
compute the predicted class output ŷ(k)
for each weight wj in the weight vector do
update the weight wj: wj(k+1) := wj(k) + (yi - ŷ(k))xij
end for;
end for;
until stopping criterion is met
return R
end;
: the learning factor. The
more the value is, the bigger
amount weight changes.
Artificial Neural Network Approach
• Using ANN for Classification
– Multiple hidden layers:
• Do not know the actual class value and hence difficult to adjust the
weight
• Solution: Back-propagation (layer by layer from the output layer)
– Model Overfitting: use validation examples to further tune the
weights in the network
– Descriptive attributes should be normalized or converted to
binary
– Training examples are used repeatedly. The training cost is
therefore very high.
– Difficulty in explaining classification decisions
Artificial Neural Network Approach
• Network Topology
– # of nodes in input layer: determined by # and data types of
attributes:
• Continuous and binary attributes: 1 node for each attribute
• categorical attribute: convert to numeric or binary
– Attribute w k labels needs at least log k nodes
– # of nodes in output layer: determined by # of classess
• For 2 class solution 1 node
• K class solution  at least log k nodes
– # of hidden layers and nodes in the hidden layers: difficult to
decide
– in NWs with hidden laeyrs: updating weights using
backpropagation
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics
to consider?
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves
79
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class
yes
no
yes
True Positives (TP)
False Negatives (FN)
no
False Positives (FP)
True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer
class
= yes
= no
Total
buy_computer = yes
6954
46
7000
buy_computer = no
412
2588
3000
Total
7366
2634
10000
• TP and TN are the correctly predicted tuples
• May have extra rows/columns to provide totals
80
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P
Y
N
Class Imbalance Problem:
Y TP FN P
 One class may be rare, e.g.
N FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the
negative class and minority of
• Classifier Accuracy, or
the positive class
recognition rate: percentage of
test set tuples that are correctly  Sensitivity: True Positive
classified
recognition rate
Accuracy = (TP + TN)/All
 Sensitivity = TP/P
• Error rate:misclassification rate  Specificity: True Negative
recognition rate
1 – accuracy, or
 Specificity = TN/N
Error rate = (FP + FN)/All

81
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
• Recall: completeness – what % of positive tuples did the
classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,
• Fß: weighted measure of precision and recall
– assigns ß times as much weight to recall as to precision
82
Classifier Evaluation Metrics: Example
–
Actual Class\Predicted class
cancer = yes
cancer = no
Total
Recognition(%)
cancer = yes
90
210
300
30.00 (sensitivity
cancer = no
140
9560
9700
98.56 (specificity)
Total
230
9770
10000
96.40 (accuracy)
Precision = ??
Recall = ??
83
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, for small sized
data
84
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
– Examples used for training set can be used for test set too
85
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging, boosting, Ensemble
86
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
– Oversampling: re-sampling of data from positive class
– Under-sampling: randomly eliminate tuples from negative
class
87
Model Selection: ROC Curves
•
•
•
•
•
•
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive rate
The area under the ROC curve is a
measure of the accuracy of the model
Diagonal line: for every TP, equally
likely to encounter FP
The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
accurate is the model




Vertical axis
represents the true
positive rate
Horizontal axis rep.
the false positive rate
The plot also shows a
diagonal line
A model with perfect
accuracy will have an
area of 1.0
88
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
89
Comparison of Techniques
• Comparison of Approaches
Classification
Approaches
Decision Tree
Nearest
Neighbours
Rul-base
Model
Interpretability
Model
maintenability
Training
Cost
Classifcation
Cost
Artificial
Neural Network
Bayesian
Classifier
Model Interpretability: ease of understanding classification decisions
Model maintenability: ease of modifying the model in the presence of new
training examples
Training cost: computational cost for building a model
Classification cost: computational cost for classifying an unseen record
Comparison of Techniques
• Comparison of Approaches
Classification
Approaches
Decision Tree
Nearest
Neighbours
Rul-base
Artificial Neural
Network
Bayesian
Classifier
Model
Interpretability
High
Model
maintenability
Low
Training
Cost
High
Classifcation
Cost
Low
Medium
High
Low
High
High
Low
High
Medium
Low
Low
High
Low
Medium
Medium
Medium
Low
Model Interpretability: ease of understanding classification decisions
Model maintenability: ease of modifying the model in the presence of new
training examples
Training cost: computational cost for building a model
Classification cost: computational cost for classifying an unseen record
Decision Tree Induction in Weka
• Overview
–
–
–
–
–
–
–
–
ID3 (only work for categorical attributes)
J48 (Java implementation of C4.5)
RandomTree (with K attributes)
RandomForest (a forest of random trees)
REPTree (regression tree with reduced error pruning)
BFTree (best-first tree, using Gain or Gini)
FT (functional tree, logistic regression as split nodes)
SimpleCart (CART with cost-complexity pruning)
Decision Tree Induction in Weka
• Preparation
Pre-processing attributes
if necessary
Specifying the class
attribute
Selecting
attributes
Decision Tree Induction in Weka
• Constructing Classification Models (ID3)
1. Choosing a method and
setting parameters
2. Setting a
test option
3. Starting
the process
5. Selecting the
option to view
the tree
4. View the model and
evaluation results
Decision Tree Induction in Weka
• J48 (unpruned tree)
Decision Tree Induction in Weka
• RandomTree
Decision Tree Induction in Weka
• Classifying Unseen Records
1. Preparing unseen records in an ARFF file
Class values are left
as “unknown” (“?”)
Decision Tree Induction in Weka
• Classifying Unseen Records
2. Classifying unseen records in the file
1.Selecting this
option and click
Set… button
3.Press to start
the classification
2.Press the button
and load the file
Decision Tree Induction in Weka
• Classifying Unseen Records
3. Saving Classification Results into a file
2.Setting both X and Y
to instance_number
3.Saving the results
into a file
1.Selecting the
option to pop up
visualisation
Decision Tree Induction in Weka
• Classifying Unseen Records
4. Classification Results in an ARFF file
Class labels
assinged
Comparison of Techniques
• Comparison of Performance in Weka
– A system module known as Experimenter
– Designated for comparing performances on techniques for
classification over a single or a collection of data sets
– Data miners setting up an experiment with:
•
•
•
•
Selected data set(s)
Selected algorithms(s) and times of repeated operations
Selected test option (e.g. cross validation)
Selected p value (indicating confidence)
– Output accuracy rates of the algorithms
– Pairwise comparison of algorithms with significant better and
worse accuracy marked out.
Comparison of Techniques
• Setting up Experiment in Weka
Choosing a
Test options
New or existing
experiment
Naming the file to
store experiment
results
Adding data
sets
No. of times each
algorithm repeated
Add an algorithm
The list of data
sets selected
The list of selected
algorithms
Comparison of Techniques
• Experiment Results in Weka
Analysis
method
Value of
significance
Performing
the Analysis
Loading
Experiment Data
Results of Pairwise
Comparisons
Classification in Practice
• Process of a Classification Project
1.
2.
3.
4.
5.
Locate data
Prepare data
Choose a classification method
Construct the model and tune the model
Measure its accuracy and go back to step 3 or 4 until the
accuracy is satisfactory
6. Further evaluate the model from other aspects such as
complexity, comprehensibility, etc.
7. Deliver the model and test it in real environment. Further
modify the model if necessary
Classification in Practice
• Data Preparation
–
–
–
–
Identify descriptive features (input attributes)
Identify or define the class
Determine the sizes of the training, validation and test sets
Select examples
•
•
•
•
Spread and coverage of classes
Spread and coverage of attribute values
Null values
Noisy data
– Prepare the input values (categorical to continuous,
continuous to categorical)
References (1)
•
•
•
•
•
•
•
•
•
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
W. Cohen. Fast effective rule induction. ICML'95
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
106
References (3)
•
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
•
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
•
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
•
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
•
S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
•
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
•
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
•
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
•
J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
107
References (4)
•
•
•
•
•
•
•
•
•
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
108