Download classifcation1

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
We can deactivate (knock-out, KO) genes in mice, and see what happens to their songs…
1- Syllable
Extraction
2- Syllable Classification
…1311 … 4521… 13327
…12521 … 12521… 12521
Normal mouse
P53-KO
Somehow I figured out that the first
sound was a ‘1’, and the second was
a ‘2’. How can we do this?
The Classification Problem
Symbol 1
(informal definition)
Given a collection of annotated data.
In this case 5 instances Symbol 1 of
and 5 of Symbol 2, decide what
type of sound the unlabeled example
is.
Symbol 1 or Symbol 2
Symbol 2
Data Mining/Machine Learning
Machine learning explores the study and construction of algorithms
that can learn from data.
Basic Idea: Instead of trying to create a very complex
program to do X. Use a (relatively) simple program
that can learn to do X.
Example: Instead of trying to program a car to drive
(If light(red) && NOT(pedestrian) || speed(X) <= 12 && .. ),
create a program that watches human drive, and
learns how to drive*.
*Currently, self driving cars do a bit of both.
Why Machine Learning I
Why do machine learning instead of just writing an explicit program?
• It is often much cheaper, faster and more accurate.
• It may be possible to teach a computer something that we are not sure
how to program. For example:
• We could explicitly write a program to tell if a person is obese
If (weightkg /(heightm  heightm))
> 30, printf(“Obese”)
•We would find it hard to write a program to tell is a person is sad
However, we could easily obtain a 1,000
photographs of sad people/ not sad
people, and ask a machine learning
algorithm to learn to tell them apart.
The Classification Problem
Katydids
(informal definition)
Given a collection of annotated data.
In this case 5 instances Katydids of
and five of Grasshoppers, decide
what type of insect the unlabeled
example is.
Katydid or Grasshopper?
Grasshoppers
The Classification Problem
Canadian
(informal definition)
Given a collection of annotated data.
In this case 3 instances Canadian of
and 3of American, decide what
type of insect the unlabeled example
is.
Canadian or American?
American
For any domain of interest, we can measure features
Color {Green, Brown, Gray, Other}
Abdomen
Length
Has Wings?
Thorax
Length
Antennae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
Sidebar 1
In data mining, we usually don’t have a choice of what features to
measure. The data is not usually collect with data mining in mind.
The features we really want may not be available: Why?
____________________
____________________
We typically have to use (a subset) of whatever data we are given.
Sidebar 2
In data mining, we can sometimes generate new features.
For example Feature X = Abdomen Length/ Antennae Length
Abdomen
Length
Antennae
Length
We can store features
in a database.
The classification
problem can now be
expressed as:
• Given a training database
(My_Collection), predict
the class label of a
previously unseen instance
My_Collection
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
previously unseen instance =
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
11
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
5.1
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydids
???????
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Grasshoppers
We will also use this lager dataset
as a motivating example…
Antenna Length
10
9
8
7
6
5
4
3
2
1
Katydids
Each of these data
objects are called…
• exemplars
• (training) examples
• instances
• tuples
1 2 3 4 5 6 7 8 9 10
Abdomen Length
We will return to the
previous slide in two minutes.
In the meantime, we are
going to play a quick game.
I am going to show you some
classification problems which
were shown to pigeons!
Let us see if you are as
smart as a pigeon!
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
Pigeon Problem 1
Examples of
class A
3
4
1.5
6
5
8
What class is
this object?
Examples of
class B
5
2.5
5
2
8
3
8
What about this
one, A or B?
4.5
2.5
5
4.5
3
1.5
7
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
This is a B!
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
8
1.5
Here is the rule.
If the left bar is
smaller than the
right bar, it is an A,
otherwise it is a B.
Pigeon Problem 2
Examples of
class A
Oh! This ones
hard!
Examples of
class B
4
4
5
2.5
5
5
2
5
6
6
5
3
8
Even I know this
one
7
3
3
2.5
3
1.5
7
Pigeon Problem 2
Examples of
class A
Examples of
class B
4
4
5
2.5
5
5
2
5
The rule is as follows,
if the two bars are
equal sizes, it is an A.
Otherwise it is a B.
So this one is an A.
6
6
5
3
7
3
3
2.5
3
7
Pigeon Problem 3
Examples of
class A
Examples of
class B
6
4
4
5
6
1
5
7
5
6
3
4
8
3
7
7
7
6
This one is really hard!
What is this, A or B?
Pigeon Problem 3
Examples of
class A
It is a B!
Examples of
class B
6
4
4
5
6
6
1
5
7
5
6
3
4
8
3
7
7
7
The rule is as follows,
if the square of the
sum of the two bars is
less than or equal to
100, it is an A.
Otherwise it is a B.
Why did we spend so much
time with this stupid game?
Because we wanted to
show that almost all
classification problems
have a geometric
interpretation, check out
the next 3 slides…
Examples of
class A
3
Examples of
class B
5
4
2.5
Left Bar
Pigeon Problem 1
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5
5
5
2
6
8
8
3
2.5
5
4.5
3
Here is the rule again.
If the left bar is smaller
than the right bar, it is
an A, otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
2.5
Left Bar
Pigeon Problem 2
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
5
5
2
5
6
6
5
3
3
3
2.5
3
Let me look it up… here it is..
the rule is, if the two bars
are equal sizes, it is an A.
Otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
6
Left Bar
Pigeon Problem 3
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
Right Bar
1
5
7
5
6
3
4
8
3
7
7
7
The rule again:
if the square of the sum of the
two bars is less than or equal
to 100, it is an A. Otherwise it
is a B.
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
previously unseen instance =
11
5.1
7.0
???????
• We can “project” the
previously unseen instance
into the same space as the
database.
Antenna Length
10
9
8
7
6
5
4
3
2
1
• We have now abstracted
away the details of our
particular problem. It will
be much easier to talk about
points in space.
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
Simple Linear Classifier
10
9
8
7
6
5
4
3
2
1
R.A. Fisher
1890-1962
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
Simple Quadratic Classifier
Simple Cubic Classifier
Simple Quartic Classifier
Simple Quintic Classifier
Simple…..
10
9
8
7
6
5
4
3
2
1
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
The simple linear classifier is defined for higher dimensional spaces…
… we can visualize it as
being an n-dimensional
hyperplane
It is interesting to think about what would happen in this example if
we did not have the 3rd dimension…
We can no longer get perfect
accuracy with the simple linear
classifier…
We could try to solve this
problem by user a simple
quadratic classifier or a simple
cubic classifier..
However, as we will later see,
this is probably a bad idea…
Which of the “Pigeon Problems” can be
solved by the Simple Linear Classifier?
1) Perfect
2) Useless
3) Pretty Good
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Problems that can be
solved by a linear
classifier are call
linearly separable.
10
9
8
7
6
5
4
3
2
1
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
1 2 3 4 5 6 7 8 9 10
Revisiting Sidebar 2
What would happen if we created a new feature Z, where:
Z= abs(X.value - X.value)
All blue points are perfectly
aligned, so we can only see one
1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Virginica
A Famous Problem
R. A. Fisher’s Iris Dataset.
• 3 classes
• 50 of each class
Setosa
The task is to classify Iris plants
into one of 3 varieties using the
Petal Length and Petal Width.
Iris Setosa
Versicolor
Iris Versicolor
Iris Virginica
We can generalize the piecewise linear classifier to N classes, by
fitting N-1 lines. In this case we first learned the line to (perfectly)
discriminate between Setosa and Virginica/Versicolor, then we
learned to approximately discriminate between Virginica and
Versicolor.
Virginica
Setosa
Versicolor
If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width…
We have now seen one classification
algorithm, and we are about to see more.
How should we compare them?
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
– efficiency in disk-resident databases
• Robustness
– handling noise, missing values and irrelevant features,
streaming data
• Interpretability:
– understanding and insight provided by the model
Predictive Accuracy I
Hold Out Data
• How do we estimate the accuracy of our classifier?
We can use Hold Out data
We divide the dataset into 2 partitions, called train and test. We build our models
on train, and see how well we do on test.
Insect
ID
Abdomen
Length
Antennae
Length
1
2.7
5.5
Insect Class
Grasshopper
train
10
9
8
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
7
6
5
4
3
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
6
2.9
1.9
Grasshopper
8
0.5
1.0
Grasshopper
7
6.1
6.6
Katydid
9
8.3
6.6
Katydid
8
0.5
1.0
Grasshopper
10
8.1
4.7
Katydids
9
8.3
6.6
Katydid
10
8.1
4.7
Katydids
test
2
1
1
2
3
4
5
6
7
8
9 10
Predictive Accuracy II
• How do we estimate the accuracy of our classifier?
We can use K-fold cross validation
We divide the dataset into K equal sized sections. The algorithm is tested K times,
each time leaving out one of the K section from building the classifier, but using it
to test the classifier instead
Accuracy =
K=5
Number of correct classifications
Number of instances in our database
Insect
ID
Abdomen
Length
Antennae
Length
Insect Class
10
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
9
8
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydids
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
The Default Rate
How accurate can we be if we use no features?
The answer is called the Default Rate, the size of the most common
class, over the size of the full dataset. Default Rate  size (most common class )
size (dataset )
No features
Examples:
I want to predict the sex of some pregnant friends babies.
The most common class is ‘boy’, so I will always say
‘boy’.
101
 50.024%
101  100
I do just a tiny bit better than random guessing.
I want to predict the sex of the nurse that will give me a
flu shot next week.
The most common class is ‘female’, so I will say
266634
‘female’.
 85.29%
266634  45971
Predictive Accuracy III
• Using K-fold cross validation is a good way to set any parameters we may need to
adjust in (any) classifier.
• We can do K-fold cross validation for each possible setting, and choose the model
with the highest accuracy. Where there is a tie, we choose the simpler model.
• Actually, we should probably penalize the more complex models, even if they are more
accurate, since more complex models are more likely to overfit (discussed later).
Accuracy = 94%
10
9
8
7
6
5
4
3
2
1
Accuracy = 99%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Accuracy = 100%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Predictive Accuracy III
Accuracy =
Number of correct classifications
Number of instances in our database
Accuracy is a single number, we may be better off looking at a
confusion matrix. This gives us additional useful information…
True label is...
Cat Dog Pig
Classified as a…
Cat
Dog
Pig
100 0
9 90
45 45
0
1
10
Speed and Scalability I
We need to consider the time and space requirements
for the two distinct phases of classification:
• Time to construct the classifier
• In the case of the simpler linear classifier, the time taken to fit the line, this
is linear in the number of instances.
• Time to use the model
• In the case of the simpler linear classifier, the time taken to test which side
of the line the unlabeled instance is. This can be done in constant time.
10
As we shall see, some classification
algorithms are very efficient in one aspect,
and very poor in the other.
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Speed and Scalability II
For learning with small datasets, this
is the whole picture
Speed and Scalability I
We need to consider the time and space requirements
for the two distinct phases of classification:
• Time to construct the classifier
• In the case of the simpler linear classifier, the time taken to fit the line, this
is linear in the number of instances.
However, for data mining with
massive datasets, it is not so much the
(main memory) time complexity that
matters, rather it is how many times
we have to scan the database.
• Time to use the model
• In the case of the simpler linear classifier, the time taken to test which
side of the line the unlabeled instance is. This can be done in constant time.
As we shall see, some classification
algorithms are very efficient in one aspect,
and very poor in the other.
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
This is because for most data mining operations, disk access times
completely dominate the CPU times.
For data mining, researchers often report the number of times you
must scan the database.
7
8
9 10
Robustness I
We need to consider what happens when we have:
• Noise
• For example, a persons age could have been mistyped as 650
instead of 65, how does this effect our classifier? (This is important
only for building the classifier, if the instance to be classified is noisy we can
do nothing).
•Missing values
• For example suppose we want to classify
an insect, but we only know the abdomen
length (X-axis), and not the antennae
length (Y-axis), can we still classify the
instance?
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Robustness II
We need to consider what happens when we have:
• Irrelevant features
For example, suppose we want to classify people as either
• Suitable_Grad_Student
• Unsuitable_Grad_Student
And it happens that scoring more than 5 on a particular
test is a perfect indicator for this problem…
10
9
8
7
6
5
4
3
2
1
If we also use
“hair_length” as a
feature, how will this
effect our classifier?
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Robustness III
We need to consider what happens when we have:
• Streaming data
For many real world problems, we don’t have a single fixed
dataset. Instead, the data continuously arrives, potentially
forever… (stock market, weather data, sensor data etc)
Can our classifier handle streaming data?
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Interpretability
As a trivial example, if we try to classify
peoples health risks based on just their
height and weight, we could gain the
following insight (Based of the observation
that a single linear classifier does not work
well, but two linear classifiers do).
There are two ways to be unhealthy,
being obese and being too skinny.
Weight
Some classifiers offer a bonus feature. The structure
of the learned classifier tells use something about the
domain.
Height
Nearest Neighbor Classifier
Antenna Length
10
9
8
7
6
5
4
3
2
1
Evelyn Fix
Joe Hodges
1904-1965
1922-2000
If the nearest instance to the previously
unseen instance is a Katydid
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
We can visualize the nearest neighbor algorithm in terms of
a decision surface…
Note the we don’t actually have
to construct these surfaces, they
are simply the implicit
boundaries that divide the
space into regions “belonging”
to each instance.
This division of space is called
Dirichlet Tessellation (or Voronoi
diagram, or Theissen regions).
The nearest neighbor algorithm is sensitive to outliers…
The solution is to…
We can generalize the nearest neighbor algorithm to
the K- nearest neighbor (KNN) algorithm.
We measure the distance to the nearest K instances, and let
them vote. K is typically chosen to be an odd number.
K=1
K=3
The nearest neighbor algorithm is sensitive to irrelevant features…
Suppose the following is true, if
an insects antenna is longer than
5.5 it is a Katydid, otherwise it
is a Grasshopper.
Training data
1 2 3 4 5 6 7 8 9 10
6
1 2 3 4 5 6 7 8 9 10
Using just the antenna length we
get perfect classification!
1 2 3 4 5 6 7 8 9 10
5
Suppose however, we add in
an irrelevant feature, for
example the insects mass.
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Using both the antenna length
and the insects mass with the
1-NN algorithm we get the
wrong classification!
How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?
• Use more training instances
• Ask an expert what features are relevant to the task
• Use statistical tests to try to determine which
features are useful
• Search over feature subsets (in the next slide we
will see why this is hard)
Why searching over feature subsets is hard
Suppose you have the following classification problem, with 100
features, where is happens that Features 1 and 2 (the X and Y
below) give perfect classification, but all 98 of the other features are
irrelevant…
Only Feature 2
Only Feature 1
Using all 100 features will give poor results, but so will using only
Feature 1, and so will using Feature 2! Of the 2100 –1 possible
subsets of the features, only one really works.
1,2
1
2
3
4
1,3
2,3
1,4
2,4
1,2,3
•Forward Selection
•Backward Elimination
•Bi-directional Search
1,2,4
1,3,4
1,2,3,4
3,4
2,3,4
The nearest neighbor algorithm is sensitive to the units of measurement
X axis measured in centimeters
Y axis measure in dollars
The nearest neighbor to the pink
unknown instance is red.
X axis measured in millimeters
Y axis measure in dollars
The nearest neighbor to the pink
unknown instance is blue.
One solution is to normalize the units to pure numbers. Typically the
features are Z-normalized to have a mean of zero and a standard
deviation of one. X = (X – mean(X))/std(x)
We can speed up nearest neighbor algorithm by “throwing
away” some data. This is called data editing.
Note that this can sometimes improve accuracy!
We can also speed up classification with indexing
One possible approach.
Delete all instances that are
surrounded by members of
their own class.
Up to now we have assumed that the nearest neighbor algorithm uses
the Euclidean Distance, however this need not be the case…
DQ, C    qi  ci 
n
2
DQ, C  
p
i 1
10
9
8
7
6
5
4
3
2
1
p


q

c
 i i
n
i 1
Max (p=inf)
Manhattan (p=1)
Weighted Euclidean
Mahalanobis
1 2 3 4 5 6 7 8 9 10
…In fact, we can use the nearest neighbor algorithm with
any distance/similarity function
For example, is “Faloutsos” Greek or Irish? We
could compare the name “Faloutsos” to a
database of names using string edit distance…
edit_distance(Faloutsos, Keogh) = 8
edit_distance(Faloutsos, Gunopulos) = 6
Hopefully, the similarity of the name (particularly
the suffix) to other Greek names would mean the
nearest nearest neighbor is also a Greek name.
ID
1
2
3
4
5
6
7
8
Name
Class
Gunopulos Greek
Papadopoulos Greek
Kollios
Dardanos
Keogh
Gough
Greenhaugh
Hadleigh
Greek
Greek
Irish
Irish
Irish
Irish
Specialized distance measures exist for DNA strings, time series,
images, graphs, videos, sets, fingerprints etc…
Edit Distance Example
It is possible to transform any string Q into
string C, using only Substitution, Insertion
and Deletion.
Assume that each of these operators has a
cost associated with it.
How similar are the names
“Peter” and “Piotr”?
Assume the following cost function
Substitution
Insertion
Deletion
1 Unit
1 Unit
1 Unit
D(Peter,Piotr) is 3
The similarity between two strings can be
defined as the cost of the cheapest
transformation from Q to C.
Peter
Note that for now we have ignored the issue of how we can find this cheapest
transformation
Substitution (i for e)
Piter
Insertion (o)
Pioter
Deletion (e)
Piotr
Setting parameters and overfitting
You need to classify widgets, you get a training set..
• You could use a Linear Classifier or Nearest Neighbor …
Model Selection
• Nearest Neighbor
•You could use 1NN, 3NN, 5NN…
• You could use Euclidean Distance, LP1, Lpinf, Mahalanobis…
• You could do some data editing…
• You could do some feature weighting…
• You could ….
Parameter Selection
• “Linear Classifier”
• You could use a Constant classifier
Or parameter
• You could use a Linear Classifier
tuning, tweaking
• You could use a Quadratic Classifier
• You could….
Setting parameters and overfitting
You need to classify widgets, you get a training set..
• You could use a Linear Classifier or Nearest Neighbor …
• Nearest Neighbor
•You could use 1NN, 3NN, 5NN…
• You could use Euclidean Distance, LP1, Lpinf, Mahalanobis…
• You could do some data editing…
• You could do some feature weighting…
• You could ….
• “Linear Classifier”
• You could use a Constant classifier
• You could use a Linear Classifier
• You could use a Quadratic Classifier
• You could….
Overfitting
Overfitting occurs when a statistical model describes random
error or noise instead of the underlying relationship.
Overfitting generally occurs when a model is excessively
complex, such as having too many parameters relative to the
number of observations.
A model which has been overfit will generally have
poor predictive performance, as it can exaggerate minor
fluctuations in the data.
Suppose we need to solve a classification problem
We are not sure if we should us the..
• Simple linear classifier
or the
• Simple quadratic classifier
How do we decide which to use?
We do cross validation or
leave-one out and choose the
best one.
• Simple linear classifier gets 81% accuracy
• Simple quadratic classifier 99% accuracy
100
90
80
70
60
50
40
30
20
10
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
• Simple linear classifier gets 96% accuracy
• Simple quadratic classifier 97% accuracy
This problem is greatly exacerbated by
having too little data
• Simple linear classifier gets 90% accuracy
• Simple quadratic classifier 95% accuracy
What happens as we have more and more training examples?
The accuracy for all models goes up!
The chance of making a mistake (choosing the wrong model) goes down
Even if we make a mistake, it will not matter too much (because we would
learn a degenerate quadratic it is basically a straight line)
• Simple linear 70% accuracy
• Simple quadratic 90% accuracy
• Simple linear 90% accuracy
• Simple quadratic 95% accuracy
• Simple linear 99.999999% accuracy
• Simple quadratic 99.999999% accuracy
One Solution: Charge Penalty for complex models
• For example, for the simple {polynomial} classifier, we could
“charge” 1% for every increase in the degree of the polynomial
• Simple linear classifier gets 90.5%
• Simple quadratic classifier 97.0%
• Simple cubic classifier 97.05%
Accuracy = 90.5%
10
9
8
7
6
5
4
3
2
1
accuracy, minus 0, equals 90.5%
accuracy, minus 1, equals 96.0%
accuracy, minus 2, equals 95.05%
Accuracy = 97.0%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Accuracy = 97.05%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
One Solution: Charge Penalty for complex models
• For example, for the simple {polynomial} classifier, we could charge 1% for every
increase in the degree of the polynomial.
• There are more principled ways to charge penalties
• In particular, there is a technique called Minimum
Description Length (MDL)