Download Classification

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification
Brendan Tierney
Classification vs. Prediction
§  Classification:
§  predicts categorical class labels
§  classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
§  Prediction:
§  models continuous-valued functions, i.e., predicts unknown or missing values
§  Typical Applications
§ 
§ 
§ 
§ 
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
2
1
Classification: Definition
§  Given a collection of records (training set )
§  Each record contains a set of attributes, one of the attributes is the class.
§  Find a model for the class attribute as a function of the values of other
attributes.
§  Goal: previously unseen records should be assigned a class as accurately
as possible.
§  A test set is used to determine the accuracy of the model. Usually, the given data set is divided
into training and test sets, with training set used to build the model and test set used to validate
it.
3
Classification—A Two-Step Process
§  Model construction: describing a set of predetermined classes
§  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
§  The set of tuples used for model construction: training set
§  The model is represented as classification rules, decision trees, or mathematical formulae
§  Estimate accuracy of the model
§  The known label of test sample is compared with the classified result from the model
§  Accuracy rate is the percentage of test set samples that are correctly classified by the
model
§  Test set is independent of training set, otherwise over-fitting will occur
Classification is about explaining the past
and
predicting the future.
§  Model usage: for classifying future or unknown objects
§  The model is used to classify data tuples whose class labels are not known
4
2
Important Columns
Data Assembly Operations
Target
3
50 – 50 Split
§  Need to avoid a biased dataset
§  50 – 50 split of target column - if possible
§  Not always possible
Target 2
Data Set
50%
Target 1
Target 1
Target 2
50%
50%
50%
Splitting the data set for Training and Testing
30% - 40%
Testing Data Set
Training Data Set
Target 1
Target 2
?%
?%
Target 1
Target 2
50%
50%
60 %– 70%
4
Data Preparation
§  Can involve lots and lots and lots of coding and data preparation
§  1 records per case aka the analytical record
§  Target value is determined based on defined rules (defined with help from
the business)
§  Sampling of data
§  In some cases one Target value occurs in a very small % of cases
§  Training set needs to be balanced
§  Over sampling on small target cases
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
10
5
Classification Process (1): Model Construction
Classification
Algorithms
Training
Data
NAM E
M ike
M ary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Example of a Decision Tree rule
11
Classification Process (2): Use the Model in Prediction
Classifier
(previously developed)
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eorge
Joseph
RANK
YEA R S TEN U R ED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
This is what we have
But
What should it be?
12
6
Example of a Decision Tree
ca
ic
or
eg
t
al
ca
ic
or
eg
al
t
us
uo
n
s
i
nt
as
cl
co
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
TaxInc
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Married
Single, Divorced
NO
< 80K
> 80K
YES
NO
10
What attribute
do we Decision
start with ? Tree
Model:
This is called Information Gain
Training Data
13
Another Example of Decision Tree
ca
te
ric
go
al
ca
te
ric
go
al
us
uo
s
it n
n
as
cl
co
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
10
14
7
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
15
How usable or re-usable is the model
§  Does it work for just this one time or can be be used over time with different
data?
8
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
17
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
18
9
Apply Model to Test Data
Test Data
Refund
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
> 80K
YES
NO
19
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
20
10
Apply Model to Test Data
Test Data
Refund
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
21
Apply Model to Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
22
11
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Data Mining looks to find what
algorithm works best for your
data.
- black/gray box approach
Model
Training Set
Attrib3
Different tools come with
slightly different sets of
algorithms
The main ones are ….
10
Attrib2
The machine learning
algorithms do all the hard
work.
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
23
Training Dataset
12
Resultant Decision Tree
Algorithm for Decision Tree Induction
§  Basic algorithm (a greedy algorithm)
§ 
§ 
§ 
§ 
§ 
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information
gain)
§  Conditions for stopping partitioning
§  All samples for a given node belong to the same class
§  There are no remaining attributes for further partitioning – majority voting is employed for
classifying the leaf
§  There are no samples left
26
13
Extracting Classification Rules from Trees
§ 
§ 
§ 
§ 
§ 
§ 
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
27
Decision Tree in SAS
28
14
Decision Tree in Oracle
Classification Accuracy Measures
Classification accuracy can be measured in many different ways
Two of the most popular are:
§  Classification / Misclassification rates
§  Other associated values
§  Primarily used for binary classification
§  Average squared error
§  Typically used with Regression / Prediction / Continuous values
§  Used in conjunction with Classification / Misclassification rates
More on measuring Classification Accuracy later
15
Random Forests
§  Random forests
§  are an ensemble learning method for classification, regression and other tasks, that
operate by constructing a multitude of decision trees at training time and outputting the
class that is the mode of the classes (classification) or mean prediction (regression) of
the individual trees.
Random Forests
16
Data Splitting – For Training only
70% / 30%
60% / 40%
Fool’s Gold
17
Fool’s Gold
Model Complexity
Too flexible
Not flexible
enough
18
Overfitting
Training Set
Test Set
Overfitting (cont…)
Training Set
Test Set
19
Avoiding Overfitting In Classification
An induced tree may over-fit the training data
§  Too many branches, some may reflect anomalies due to noise or outliers
§  Poor accuracy for unseen samples
Two approaches to avoiding overfitting
§  Prepruning: Halt tree construction early
§  Do not split a node if this would result in a measure of the usefullness of the tree falling
below a threshold
§  Difficult to choose an appropriate threshold
Can involve a lot of §  Application / Software / Algorithm Settings
human inves0ga0on §  Postpruning: Remove branches from a “fully grown” tree to give a sequence of
progressively pruned trees
§  Use a set of data different from the training data to decide which is the “best pruned tree”
§  Write code to manage/merge the different nodes/branches
§  Will require some time to investigate the best options for this
Approaches to Determine the Final Tree Size
§  Separate training (2/3) and testing (1/3) sets
§  Use cross validation, e.g., 10-fold cross validation
§  Use all the data for training
§  But very difficult to evaluate the models
§  Representative sample of data in Training and Test data sets
§  Not always possible to do
§  Most DM tools will sample the data to give the training and testing data sets
§  Use minimum description length (MDL) principle:
§  halting growth of the tree when the encoding is minimized
40
20
Classification in Large Databases
§  Classification—a classical problem extensively studied by statisticians and machine
learning researchers
§  Scalability: Classifying data sets with millions of examples and hundreds of attributes with
reasonable speed
§  Why decision tree induction in data mining?
§  relatively faster learning speed (than other classification methods)
§  convertible to simple and easy to understand classification rules
§  can use SQL queries for accessing databases
§  comparable classification accuracy with other methods
41
Supervised vs. Unsupervised Learning
§  Supervised learning (classification)
§  Supervision: The training data (observations, measurements, etc.) are accompanied by
labels indicating the class of the observations
§  New data is classified based on the training set
§  Unsupervised learning (clustering)
§  The class labels of training data is unknown
§  Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
42
21
Case-Based Reasoning
§ Case based reasoning is a classification technique which uses prior
examples (cases) to classify unknown cases
§ Uses lazy evaluation and analysis of similar instances
§ Methodology
§  Instances represented by rich symbolic descriptions – similarity measures
§  Multiple retrieved cases may be combined
§  Tight coupling between case retrieval, knowledge-based reasoning, and problem
solving
Case-Based Reasoning (cont…)
§ The k-nearest neighbour (k-NN) algorithm is the simplest form of case
based reasoning
§ However, instances are not necessarily “points in a Euclidean space”
§ Lots of active research issues
22
The Nearest Neighbor Algorithm
§ All instances correspond to points in n-D space
§ The nearest neighbours are defined in terms of Euclidean distance (or other
appropriate measure)
§ The target value can be discrete or real-valued
§ For discrete targets, k-NN returns the most common value among the k
training examples nearest to the query
§ For real-valued targets, k-NN returns a combination (e.g. average) of the
nearest neighbours’ target values
Nearest Neighbour Example
Wave
Size
(ft)
Wave Period
(secs)
6
15
Good
Surf?
Yes
1
6
No
5
11
Yes
7
10
Yes
6
11
Yes
2
1
No
3
4
No
6
12
Yes
4
2
No
Query
10
10
?
23
Nearest Neighbour Example
Wave Size
(ft)
Wave Period
(secs)
6
15
1
6
No
5
11
Yes
7
10
Yes
6
11
Yes
2
1
No
3
4
No
6
12
Yes
4
2
No
6
15
Yes
Good
Surf?
Yes
1
6
No
5
11
Yes
7
10
Yes
6
11
Yes
2
1
No
3
4
No
6
12
Yes
4
2
No
Get gets more difficult
as the number of
cases increase
Nearest Neighbour Example
When a new case is to be classified:
§  Calculate the distance from the new
case to all training cases
§  Put the new case in the same class
as its nearest neighbour
f1
Wave Size
?
?
?
Wave Period
f2
24
k-Nearest Neighbour Example
What about when it’s too close to call?
Use the k-nearest neighbour technique
f1
Wave Size
2
1
§  Determine the k nearest neighbours to
the query case
§  Put the new case into the same class as
the majority of its nearest neighbours
neighbours
vs.
neighbour
?
f2
Wave Period
© Brendan Tierney
[email protected]
Nearest Neighbour Distance Measures
§ Any kind of measurement can be used to calculate the distance between cases
§ The measurement most suitable will depend on the type of features in the problem
§ Euclidean distance is the most used technique
d=
n
∑ (t
i
− qi ) 2
i =1
§ where n is the number of features, ti is the ith feature of the training case and qi is the ith
feature of the query case
25
Classification by Decision Tree Induction
§  Decision tree
§  Decision trees are the most widely used classification technique in data mining today
§  Formulate problems into a tree composed of decision nodes (or branch nodes) and classification
nodes (or leaf nodes)
§  Internal node denotes a test on an attribute
§  Branch represents an outcome of the test
§  Leaf nodes represent class labels or class distribution
§  Problem is solved by navigating down the tree until we reach an appropriate leaf node
§  The tricky bit is building the most efficient and powerful tree
§  Decision tree generation consists of two phases
§  Tree construction
§  At start, all the training examples are at the root
§  Partition examples recursively based on selected attributes
§  Tree pruning
§  Identify and remove branches that reflect noise or outliers
§  Use of decision tree: Classifying an unknown sample
§  Test the attribute values of the sample against the decision tree
51
52
26
Classification is about explaining the past
and
predicting the future.
27
Bayesian Classification: Why?
§  Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
§  Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge can be combined with observed data.
§  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
§  Standard: Even when Bayesian methods are computationally intractable, they can provide
a standard of optimal decision making against which other methods can be measured
55
Bayesian Theorem
§  Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes
theorem
P(h | D) = P(D | h)P(h)
P(D)
§  MAP (maximum posteriori) hypothesis
h
≡ arg max P(h | D) = arg max P(D | h)P(h).
MAP
h∈H
h∈H
§  Practical difficulty: require initial knowledge of many probabilities, significant computational
cost
56
28
Bayesian classification
§  The classification problem may be formalized using a-posteriori probabilities:
§  P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
§  E.g. P(class=N | outlook=sunny,windy=true,…)
§  Idea: assign to sample X the class label C such that P(C|X) is maximal
57
Estimating a-posteriori probabilities
§  Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
§  P(X) is constant for all classes
§  P(C) = relative freq of class C samples
§  C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
§  Problem: computing P(X|C) is unfeasible!
58
29
Naïve Bayesian Classification
§  Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
§  If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
§  If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
§  Computationally easy in both cases
59
Play-tennis example: estimating P(xi|C)
60
30
Play-tennis example: estimating P(xi|C)
61
Play-tennis example: classifying X
§  An unseen sample X = <rain, hot, high, false>
§  P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
§  P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
§  Sample X is classified in class n (don’t play)
62
31
Naïve Bayesian Classifier: Comments
Advantages :
§  Easy to implement
§  Good results obtained in most of the cases
Disadvantages
§ 
§ 
§ 
§ 
§ 
Assumption: class conditional independence , therefore loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
Dependencies among these cannot be modeled by Naïve Bayesian Classifier
How to deal with these dependencies?
§  Bayesian Belief Networks
63
Neural Networks
•  Very good results
•  Suitable for a large number of problem areas
•  But
•  Very slow at building models. Too much complexity
•  Cannot be used in many “Real World” scenarios
•  Cannot explain how or why it is doing
•  Cannot integrate into applications
•  Cannot do real time
32
Support Vector Machines (SVM)
§  In classification problems we try to create decision boundaries between
classes
§  A choice must be made between possible boundaries
65
SVMs (cont…)
§  The decision boundary should be as far away from the data of both classes
as possible
66
33
Margins
67
SVMs: The Clever Bit!
§  What about when classes are not linearly separable?
§  Kernel functions and the kernel trick are used to transform data into a
different linearly separable feature space
68
34
SVMs: The Clever Bit! (cont...)
§  What if the data is not linearly separable?
§  Project the data to high dimensional space where it is linearly separable and
then we can use linear SVM – (Using Kernels)
69
Summary Of SVM Classification
§  Strengths
§ 
§ 
§ 
§ 
Over-fitting is not common
Works well with high dimensional data
Fast classification
Good generalization capacity
§  Weaknesses
§  Retraining is difficult
§  No explanation capability
§  Slow training
§  At the cutting edge of machine learning
70
35
Single Class – Support Vector Machine
§  What if we have data with just one class value
§  Will try to map data into MORE Like and LESS Like
§  Anomaly Detection
§  If the prediction is 1, the case is considered typical.
§  If the prediction is 0, the case is considered anomalous.
§  You can specify the percentage of the data that you expect to be anomalous.
§  If you have some knowledge that the number of "suspicious" cases is a certain
percentage of your population, then you can set the outlier rate to that percentage.
§  The model will identify approximately that many "rare“ cases when applied to the
general population. The default is 10%, which is probably high for many anomaly
detection problems.
Single Class – Support Vector Machine – Fraud Detection
…
…
§  Highest probability of anomaly contain highly questionable data values when comparing the make and age
of the wrecked vehicle with the reported value of that vehicle.
§  This data may indicate the need for further investigation of these particular claims for possible fraud
36
What Is Prediction?
§  Prediction is similar to classification
§  First, construct a model
§  Second, use model to predict unknown an value
§  Major method for prediction is regression
§  Linear and multiple regression
§  Non-linear regression
§  Prediction is different from classification
§  Classification refers to predict categorical class label
§  Prediction models continuous-valued functions
73
Linear Regression Prediction Formula
input
measurement
ˆ0 + w
ˆ 1 ⋅ x1 + w
ˆ 2 ⋅ x2
yˆ = w
intercept
estimate
prediction
estimate
parameter
estimate
Choose intercept and parameter estimates to minimize.
squared error function
∑ (y
i
2
− yˆ i )
training
data
74
37
Regression Example
§  Engineers concerned about the stability of the Leaning
Tower of Pisa have done extensive studies of its
increasing tilt. The following table gives measurements for
the years 1975 to 1987.
§  The variable “lean” represents the difference between
where a point on the tower would be if the tower were
straight and where it actually is. The lean is measured in
meters.
§  Where will the tower be in 1988 or 1989?
75
Regression Example
?
Year 1988 Tilt 1989 1990 1995 1988 1989 1990 1995 ? ? ? ? 38
Regression Example
Pros/Cons Of Regression Analysis
§  Strengths
§  Mathematically very well established
§  Can be readily understood
§  Can be extended to handle non-linear problems (polynomial regression)
§  Weaknesses
§  Does not cope with missing data
§  Some problems with categorical data
78
39
Logistic Regression
§  Binary Logistic Regression is a type of regression analysis where the
dependent variable is coded as 0 or 1.
§  But
§  With our data this is not always possible
§ 
§ 
§ 
§ 
§ 
Life time value
Income
Profit
Usage
….
§  Has a Numeric target/value
40
Can you think of problems can this be used on?
Regression Measures
R-squared is the percentage of the response variable
variation that is explained by a linear model.
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
Residual = Observed value - Fitted value
0% indicates that the model explains none of the
variability of the response data around its mean.
100% indicates that the model explains all the
variability of the response data around its mean.
In general, the higher the R-squared, the better the
model fits your data.
41
Regression Measures
Regression Prediction
§  Can be difficult to predict the correct value
§  Does the predicted value make sense
§  Is there a difference between 1001 and 1005
§  Sometimes the continuous value is Binned into ranges
§  Can then use binary classification for this
§  Only if there is a small-ish number of Bins
42
Concerns Over Classification Techniques
§  When choosing a technique for a specific classification problem we must
consider the following issues:
§ 
§ 
§ 
§ 
§ 
§ 
§ 
Classification accuracy
Training speed
Classification speed
Danger of over-fitting
Generalisation capacity
Implications for retraining
Explanation capability
85
Evaluating Classification Accuracy
§  During development, and in testing before deploying a classifier in the wild,
we need to be able to quantify the performance of the classifier
§  How accurate is the classifier?
§  When the classifier is wrong, how is it wrong?
§  Useful to decide on which classifier (which parameters) to use and to
estimate what the performance of the system will be
86
43
Evaluating Classifiers (cont…)
§  How we do this depends on how much data is available
§  If there is unlimited data available then there is no problem
§  Usually we have less data than we would like so we have to compromise
§  Use hold-out testing sets
§  Cross validation
§  K-fold cross validation
§  Leave-one-out validation
§  Parallel live test
87
Hold-Out Testing Sets
§  Split the available data into a training set and a test set
§  Train the classifier in the training set and evaluate based on the test set
§  A couple of drawbacks
§  We may not have enough data
§  We may happen upon an unfortunate split
88
44
K-Fold Cross Validation
§  Divide the entire data set into k folds
§  For each of k experiments, use kth fold for testing and everything else for
training
89
K-Fold Cross Validation (cont…)
§  The accuracy of the system is calculated as the average error across the k
folds
§  The main advantages of k-fold cross validation are that every example is
used in testing at some stage and the problem of an unfortunate split is
avoided
§  Any value can be used for k
§  10 is most common
§  Depends on the data set
90
45
Leave-One-Out Cross Validation
§  Extreme case of k-fold cross validation
§  With N data examples perform N experiments with N-1 training cases and 1
test case
91
Classifier Accuracy
§  The accuracy of a classifier on a given test set is the percentage of test set
tuples that are correctly classified by the classifier
§  Often also referred to as recognition rate
§  Error rate (or misclassification rate) is the opposite of accuracy
92
46
False Positives Vs False Negatives
§  While it is useful to generate the simple accuracy of a classifier, sometimes
we need more
§  When is the classifier wrong?
§  False positives vs false negatives
§  Related to type I and type II errors in statistics
§  Often there is a different cost associated with false positives and false
negatives
§  Think about diagnosing diseases
93
Confusion Matrix
§  Device used to illustrate how a classifier is performing in terms of false positives and false
negatives
§  Gives us more information than a
single accuracy figure
§  Allows us to think about the cost
of mistakes
§  Can be extended to any
number of classes
94
47
Type I
&
Type II Errors
48
Which is Better ?
49
ROC Curves
§  Receiver Operating Characteristic (ROC) curves were originally used to
make sense of noisy radio signals
§  Can be used to help us talk about classifier performance and determine the
best operating point for a classifier
§  Consider how the relationship between true
positives and false positives can change
§  We need to choose the best operating point
100
50
ROC Curves (cont…)
§  ROC curves can be used to compare classifiers
§  The greater the area under the curve the more accurate the classifier
101
Over-Fitting
§  When we train a classifier we are trying to a learn a function approximated
by the training data we happen to use
§  What if the training data doesn’t cover the whole problem space?
§  We can learn the training data too closely which hampers the ability to
generalise
§  This problem is known as over-fitting
§  Depending on the type of classifier used there are different approaches to
avoiding this
102
51
The real test or evaluation of a model
§  Deploy the model
§  Monitor the data based on classifications
§  Does the data generated from Model reflect real world
§  How long to monitor for ?
§  Gives true evaluation
Model Update
§  Do we need to update the models ?
§  New (initial) Model
§  Reflects the current position
Historical
Data
Model
New
Data
§  We keep gathering new data/cases
§  Update a Model
§  How often do we need to update the models ?
52
The real test or evaluation of a model
Real World Scenario
§  Decision Tree model selected : based on outcomes of Testing & Evaluating the
models
§  Deploy in Production J
§  Monitor Production : We are getting lots and lots of mis-classification L
§  Why ?
§  One of the other models (Niave Bayes) gave the based Production results J
§  You need to constantly monitor and look for behavior.
Assessing the Deployed Model
§  Design experiments
§ 
§ 
§ 
§ 
§ 
§ 
Compare actual results against expectations.
Compare the challenger’s results against the champion’s.
Did the model find the right people?
Did the action affect their behaviour?
What are the characteristics of the customers most affected by the intervention?
The proof of the pudding is in the eating
53
Summary
§  Classification is an extensively studied problem
§  Classification is probably one of the most widely used data mining
techniques with a lot of extensions
§  Scalability is still an important issue for database applications: thus
combining classification with database techniques should be a promising topic
§  Research directions: classification of non-relational data, e.g., text, spatial,
multimedia, etc. classification of skewed data sets
107
54