Download Data Mining: Concepts and Techniques

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
Adapted to
Computational
Biology course,
IST 2012/13,
by
Sara C. Madeira
— Chapter 6 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
December 13, 2012
Data Mining: Concepts and Techniques
1
Outline
 
What is classification?
 
Issues regarding classification
 
Classification by decision tree induction
 
Bayesian classification
 
Lazy learners (or learning from your neighbors)
 
Model Evaluation
December 13, 2012
Data Mining: Concepts and Techniques
2
Classification
 
 
 
Classification
  predicts categorical class labels (discrete or nominal)
  classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction
  models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
  Credit approval
  Target marketing
  Medical diagnosis
  Fraud detection
December 13, 2012
Data Mining: Concepts and Techniques
3
Classification—A Two-Step Process
 
 
Model construction: describing a set of predetermined classes
  Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
  The set of tuples used for model construction is training set
  The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
  Estimate accuracy of the model
  The known label of test sample is compared with the
classified result from the model
  Accuracy rate is the percentage of test set samples that are
correctly classified by the model
  Test set is independent of training set, otherwise over-fitting
will occur
  If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
December 13, 2012
Data Mining: Concepts and Techniques
4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
December 13, 2012
Data Mining: Concepts and Techniques
5
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
Tenured?
December 13, 2012
Data Mining: Concepts and Techniques
6
Supervised vs. Unsupervised Learning
 
Supervised learning (classification)
 
 
 
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
 
 
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
December 13, 2012
Data Mining: Concepts and Techniques
7
Outline
 
What is classification?
 
Issues regarding classification
 
Classification by decision tree induction
 
Bayesian classification
 
Lazy learners (or learning from your neighbors)
 
Model Evaluation and Selection
December 13, 2012
Data Mining: Concepts and Techniques
8
Issues: Data Preparation
 
Data cleaning
 
 
Relevance analysis (feature selection)
 
 
Preprocess data in order to reduce noise and handle
missing values
Remove the irrelevant or redundant attributes
Data transformation
 
Generalize and/or normalize data
December 13, 2012
Data Mining: Concepts and Techniques
9
Issues: Evaluating Classification Methods
 
 
 
 
 
 
Accuracy
  classifier accuracy: predicting class label
  predictor accuracy: guessing value of predicted attributes
Speed
  time to construct the model (training time)
  time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in coupling more data
Interpretability
  understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
December 13, 2012
Data Mining: Concepts and Techniques
10
Outline
 
What is classification?
 
Issues regarding classification
 
Classification by decision tree induction
 
Bayesian classification
 
Lazy learners (or learning from your neighbors)
 
Model Evaluation and Selection
December 13, 2012
Data Mining: Concepts and Techniques
11
Decision Tree Induction: Training Dataset
This
follows an
example
of
Quinlan’s
ID3
(Playing
Tennis)
December 13, 2012
Data Mining: Concepts and Techniques
12
Output: A Decision Tree for “buys_computer”
age?
<=30
31..40
overcast
student?
no
no
December 13, 2012
>40
credit rating?
yes
excellent
yes
yes
no
Data Mining: Concepts and Techniques
fair
yes
13
Algorithm for Decision Tree Induction
 
Basic algorithm (a greedy algorithm)
  Tree is constructed in a top-down recursive divide-and-conquer
manner
  At start, all the training examples are at the root
  Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
  Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
  All samples for a given node belong to the same class
  There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
  There are no samples left
 
 
December 13, 2012
Data Mining: Concepts and Techniques
14
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 
 
 
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
m
in D:
Info(D) = −∑ pi log 2 ( pi )
Entropy(D)
i=1
 
 
Information needed (after using A to split D into v
Entropy(Dj)
v
|
D
|
partitions) to classify D:
InfoA (D) = ∑ j × I(D j )
Gain(D,A)
j=1 | D |
€
Information gained by branching on attribute A
Gain(A) = Info(D) − InfoA (D)
December 13, 2012
€
Data Mining: Concepts and Techniques
15
Attribute Selection: Information Gain
g 
g 
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info(D) = I(9,5) = −
Entropy([9+,5-])
Infoage (D) =
9
9
5
5
log 2 ( ) − log 2 ( ) =0.940
14
14 14
14
-Entropy([2+,3-])
€
€
5
4
I(2,3) +
I(4,0)
14
14
5
+
I(3,2) = 0.694
14
5
I(2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
€
Gain(age) = Info(D) − Infoage (D) = 0.246
Similarly,
€
December 13, 2012
Data Mining: Concepts and Techniques
16
Computing Information-Gain for
Continuous-Value Attributes
 
Let attribute A be a continuous-valued attribute
 
Must determine the best split point for A
 
 
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
 
 
 
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
 
D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
December 13, 2012
Data Mining: Concepts and Techniques
17
Gain Ratio for Attribute Selection (C4.5)
 
 
Information gain measure is biased towards attributes
with a large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
| Dj |
| Dj |
SplitInfoA (D) = −∑
× log 2 (
)
|D|
|D|
j=1
 
 
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
€
 
SplitInfoA (D) = −
4
4
6
6
4
4
× log 2 ( ) − × log 2 ( ) − × log 2 ( ) = 0.926
14
14 14
14 14
14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
€
the splitting attribute
 
December 13, 2012
Data Mining: Concepts and Techniques
18
Overfitting and Tree Pruning
 
Overfitting: An induced tree may overfit the training data
 
 
 
Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
 
Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 
 
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 
Use a set of data different from the training data to decide
which is the “best pruned tree”
December 13, 2012
Data Mining: Concepts and Techniques
19
Enhancements to Basic Decision Tree Induction
 
Allow for continuous-valued attributes
 
 
 
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values
 
Assign the most common value of the attribute
 
Assign probability to each of the possible values
Attribute construction
 
 
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
December 13, 2012
Data Mining: Concepts and Techniques
20
Example:
Decision Tree ID3 for “Play Tennis”
December 13, 2012
Data Mining: Concepts and Techniques
21
Training Examples
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp.
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cold
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
weak
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
Decision Tree for “Play Tennis”
Outlook
Sunny
No
Rain
Each internal node tests an attribute
Humidity
High
Overcast
Normal
Yes
Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
Decision Tree for “Play Tennis”
Outlook
Sunny
Temperature
Hot
Humidity
High
Wind PlayTennis
Weak
?
No
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
Decision Tree for Conjunction
Outlook=Sunny ∧ Wind=Weak
Outlook
Sunny
Wind
Strong
No
Overcast
No
Weak
Yes
Rain
No
Decision Tree for Disjunction
Outlook=Sunny ∨ Wind=Weak
Outlook
Sunny
Overcast
Yes
Rain
Wind
Strong
No
Wind
Weak
Strong
Yes
No
Weak
Yes
Decision Tree for XOR
Outlook=Sunny XOR Wind=Weak
Outlook
Sunny
Overcast
Wind
Strong
Yes
Rain
Wind
Weak
No
Strong
No
Wind
Weak
Strong
Yes
No
Weak
Yes
Decision Tree
•  decision trees represent disjunctions of conjunctions
Outlook
Sunny
Humidity
High
No
Overcast
Rain
Yes
Normal
Yes
(Outlook=Sunny ∧ Humidity=Normal)
∨
(Outlook=Overcast)
∨
(Outlook=Rain ∧ Wind=Weak)
Wind
Strong
No
Weak
Yes
Top-Down Induction of
Decision Trees ID3
1.  A ← the “best” decision attribute for next node
2.  Assign A as decision attribute for node
3. For each value of A create new descendant
4.  Sort training examples to leaf node according to
the attribute value of the branch
5.  If all training examples are perfectly classified
(same value of target attribute) stop, else
iterate over new leaf nodes.
Which Attribute is ”best”?
[29+,35-]
True
[21+, 5-]
A1=?
A2=?
False
[8+, 30-]
True
[18+, 33-]
[29+,35-]
False
[11+, 2-]
Information Gain
 
Gain(D,A): expected reduction in entropy due
to sorting D on attribute A
Gain(D,A)=Entropy(D) - ∑v∈values(A) |Dv|/|D| Entropy(Dv)
Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64
= 0.99
[29+,35-]
True
[21+, 5-]
A1=?
A2=?
False
[8+, 30-]
True
[18+, 33-]
[29+,35-]
False
[11+, 2-]
Information Gain
Entropy([21+,5-]) = 0.71
Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
=0.27
Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.62
Gain(S,A2)=Entropy(S)
-51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])
=0.12
0.27 > 0.12 => A1 is best
[29+,35-]
True
[21+, 5-]
A1=?
A2=?
False
[8+, 30-]
True
[18+, 33-]
[29+,35-]
False
[11+, 2-]
Training Examples
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temp.
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cold
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Weak
Weak
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Selecting the First Attribute
D=[9+,5-]
E=0.940
D=[9+,5-]
E=0.940
Humidity
High
Wind
Normal
[3+, 4-]
E=0.985
Gain(D,Humidity)
=0.940-(7/14)*0.985
– (7/14)*0.592
=0.151
Weak
Strong
[6+, 1-]
[7+, 2-]
[3+, 2-]
E=0.592
E=0.764
E=0.971
Gain(D,Wind)
=0.940-(9/14)*0.764
– (5/14)*0.971
=0.102
Selecting the First Attribute
D=[9+,5-]
E=0.940
Outlook
Sunny
[2+, 3-]
E=0.971
Over
cast
Rain
[4+, 0]
E=0.0
Gain(D,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
[3+, 2-]
E=0.971
Selecting the Next Attribute
[D1,D2,…,D14]
[9+,5-]
Sunny
Dsunny=[D1,D2,D8,D9,D11]
[2+,3-]
?
Outlook
Overcast
Rain
[D3,D7,D12,D13]
[4+,0-]
[D4,D5,D6,D10,D14]
[3+,2-]
Yes
?
Gain(Dsunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Dsunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Dsunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Selecting the Next Attribute
Outlook
Sunny
Overcast
Humidity
Yes
[D3,D7,D12,D13]
High
No
[D1,D2]
Normal
Rain
?
[D4,D5,D6,D10,D14]
[3+,2-]
?
Yes
[D8,D9,D11]
Gain(Drain , Humidity)= …
Gain(Drain , Temp.)= …
Gain(Drain , Wind)=…
Final ID3 Decision Tree
Outlook
Sunny
Overcast
Humidity
Rain
Yes
Wind
[D3,D7,D12,D13]
High
Normal
No
Yes
[D1,D2]
[D8,D9,D11]
Strong
No
[D6,D14]
Weak
Yes
[D4,D5,D10]
Outline
 
What is classification?
 
Issues regarding classification
 
Classification by decision tree induction
 
Bayesian classification
 
Lazy learners (or learning from your neighbors)
 
Model Evaluation and Selection
December 13, 2012
Data Mining: Concepts and Techniques
40
Bayesian Classification: Why?
 
 
 
 
 
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
December 13, 2012
Data Mining: Concepts and Techniques
41
Bayesian Theorem: Basics
 
Let X be a data sample (“evidence”): class label is unknown
 
Let H be a hypothesis that X belongs to class C
 
 
Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
P(H) (prior probability), the initial probability
 
 
 
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
 
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
December 13, 2012
Data Mining: Concepts and Techniques
42
Bayesian Theorem
 
 
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
Informally, this can be written as
posteriori = likelihood x prior/evidence
 
 
Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
December 13, 2012
Data Mining: Concepts and Techniques
43
Towards Naïve Bayesian Classifier
 
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
 
Since P(X) is constant for all classes, only
 
 
 
needs to be maximized
December 13, 2012
Data Mining: Concepts and Techniques
44
Derivation of Naïve Bayes Classifier
 
 
 
 
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ
and P(xk|Ci) is
December 13, 2012
Data Mining: Concepts and Techniques
45
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
December 13, 2012
Data Mining: Concepts and Techniques
46
Naïve Bayes Classifier: An Example
 
P(Ci):
 
Compute P(X|Ci) for each class
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
December 13, 2012
Data Mining: Concepts and Techniques
47
Avoiding the 0-Probability Problem
 
 
 
Naïve Bayesian prediction requires each conditional probability to be
non-zero. Otherwise, the predicted probability will be zero
Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
  Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
  The “corrected” prob. estimates are close to their “uncorrected”
counterparts
December 13, 2012
Data Mining: Concepts and Techniques
48
Naïve Bayes Classifier: Comments
 
 
Advantages
  Easy to implement
  Good results obtained in most of the cases
Disadvantages
  Assumption: class conditional independence, therefore
loss of accuracy
  Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
  Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
 
 
How to deal with these dependencies?
  Bayesian Belief Networks
December 13, 2012
Data Mining: Concepts and Techniques
49
Outline
 
What is classification?
 
Issues regarding classification
 
Classification by decision tree induction
 
Bayesian classification
 
Lazy learners (or learning from your neighbors)
 
Model Evaluation and Selection
December 13, 2012
Data Mining: Concepts and Techniques
50
Lazy vs. Eager Learning
 
 
 
Lazy vs. eager learning
  Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and
waits until it is given a test tuple
  Eager learning (the above discussed methods): Given a
set of training set, constructs a classification model
before receiving new (e.g., test) data to classify
Lazy: less time in training but more time in predicting
Accuracy
  Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
  Eager: must commit to a single hypothesis that covers
the entire instance space
December 13, 2012
Data Mining: Concepts and Techniques
51
Lazy Learner: Instance-Based Methods
 
 
Instance-based learning:
  Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
  k-nearest neighbor approach
  Instances represented as points in a Euclidean
space.
  Locally weighted regression
  Constructs local approximation
  Case-based reasoning
  Uses symbolic representations and knowledgebased inference
December 13, 2012
Data Mining: Concepts and Techniques
52
The k-Nearest Neighbor Algorithm
 
 
 
 
 
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples
_
_
_
+
_
_
December 13, 2012
.
+
+
xq
.
_
+
.
.
Data Mining: Concepts and Techniques
.
.
53
Discussion on the k-NN Algorithm
 
k-NN for real-valued prediction for a given unknown tuple
 
 
Returns the mean values of the k nearest neighbors
Distance-weighted nearest neighbor algorithm
 
Weight the contribution of each of the k neighbors
according to their distance to the query xq
 
 
 
Give greater weight to closer neighbors
Robust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes
 
To overcome it, axes stretch or elimination of the least
relevant attributes
December 13, 2012
Data Mining: Concepts and Techniques
54
Outline
 
What is classification?
 
Issues regarding classification
 
Classification by decision tree induction
 
Bayesian classification
 
Lazy learners (or learning from your neighbors)
 
Model Evaluation
December 13, 2012
Data Mining: Concepts and Techniques
55
Model Evaluation
 
 
 
Evalua&on
metrics:
How
can
we
measure
accuracy?
Other
metrics
to
consider?
Use
test
set
of
class‐labeled
tuples
instead
of
training
set
when
assessing
accuracy
Methods
for
es&ma&ng
a
classifier’s
accuracy:
 
Holdout
method,
random
subsampling
 
Cross‐valida&on
 
Bootstrap
56
Classifier Evaluation Metrics:
Confusion Matrix
Confusion
Matrix:
Actual
class\Predicted
class
C1
¬
C1
C1
True
Posi;ves
(TP)
False
Nega;ves
(FN)
¬
C1
False
Posi;ves
(FP)
True
Nega;ves
(TN)
Example of Confusion Matrix:
Actual
class\Predicted
buy_computer
buy_computer
class
=
yes
=
no
 
 
Total
buy_computer
=
yes
6954
46
7000
buy_computer
=
no
412
2588
3000
Total
7366
2634
10000
Given
m
classes,
an
entry,
CMi,j
in
a
confusion
matrix
indicates
#
of
tuples
in
class
i
that
were
labeled
by
the
classifier
as
class
j
May
have
extra
rows/columns
to
provide
totals
57
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P
 
 
C
¬C
Class
Imbalance
Problem:
C
TP
FN
P
  One
class
may
be
rare,
e.g.
¬C
FP
TN
N
fraud,
or
HIV‐posi&ve
P’
N’
All
  Significant
majority
of
the
nega3ve
class
and
minority
of
Classifier
Accuracy,
or
the
posi&ve
class
recogni&on
rate:
percentage
of
test
set
tuples
that
are
correctly
  Sensi;vity:
True
Posi&ve
classified
recogni&on
rate
Accuracy
=
(TP
+
TN)/All
  Sensi;vity
=
TP/P
Error
rate:
1
–
accuracy,
or
  Specificity:
True
Nega&ve
recogni&on
rate
Error
rate
=
(FP
+
FN)/All
  Specificity
=
TN/N
 
58
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 
 
 
 
 
 
Precision:
exactness
–
what
%
of
tuples
that
the
classifier
labeled
as
posi&ve
are
actually
posi&ve
Recall:
completeness
–
what
%
of
posi&ve
tuples
did
the
classifier
label
as
posi&ve?
Perfect
score
is
1.0
Inverse
rela&onship
between
precision
&
recall
F
measure
(F1
or
F‐score):
harmonic
mean
of
precision
and
recall,
Fß:
weighted
measure
of
precision
and
recall
  assigns
ß
&mes
as
much
weight
to
recall
as
to
precision
59
Classifier Evaluation Metrics: Example
 
Actual
Class\Predicted
class
cancer
=
yes
cancer
=
no
Total
Recogni&on(%)
cancer
=
yes
90
210
300
30.00
(sensi3vity
cancer
=
no
140
9560
9700
98.56
(specificity)
Total
230
9770
10000
96.40
(accuracy)
Precision
=
90/230
=
39.13%
Recall
=
90/300
=
30.00%
60
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 
 
Holdout
method
  Given
data
is
randomly
par&&oned
into
two
independent
sets
  Training
set
(e.g.,
2/3)
for
model
construc&on
  Test
set
(e.g.,
1/3)
for
accuracy
es&ma&on
  Random
sampling:
a
varia&on
of
holdout
  Repeat
holdout
k
&mes,
accuracy
=
avg.
of
the
accuracies
obtained
Cross‐valida;on
(k‐fold,
where
k
=
10
is
most
popular)
  Randomly
par&&on
the
data
into
k
mutually
exclusive
subsets,
each
approximately
equal
size
  At
i‐th
itera&on,
use
Di
as
test
set
and
others
as
training
set
  Leave‐one‐out:
k
folds
where
k
=
#
of
tuples,
for
small
sized
data
  *Stra;fied
cross‐valida;on*:
folds
are
stra&fied
so
that
class
dist.
in
each
fold
is
approx.
the
same
as
that
in
the
ini&al
data
61
Evaluating Classifier Accuracy:
Bootstrap
 
Bootstrap
 
Works
well
with
small
data
sets
 
Samples
the
given
training
tuples
uniformly
with
replacement
 
 
i.e.,
each
&me
a
tuple
is
selected,
it
is
equally
likely
to
be
selected
again
and
re‐added
to
the
training
set
Several
bootstrap
methods,
and
a
common
one
is
.632
boostrap
 
 
A
data
set
with
d
tuples
is
sampled
d
&mes,
with
replacement,
resul&ng
in
a
training
set
of
d
samples.
The
data
tuples
that
did
not
make
it
into
the
training
set
end
up
forming
the
test
set.
About
63.2%
of
the
original
data
end
up
in
the
bootstrap,
and
the
remaining
36.8%
form
the
test
set
(since
(1
–
1/d)d
≈
e‐1
=
0.368)
Repeat
the
sampling
procedure
k
&mes,
overall
accuracy
of
the
model:
62
Reference
 
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 –
CHAPTER 6 (3RD edition, 2011 – CHAPTERS 8 AND 9)
 
http://www.cs.uiuc.edu/homes/hanj/bk2/
December 13, 2012
Data Mining: Concepts and Techniques
63