Download DSW - University of California, Riverside

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Science Workshop
Introduction to Machine Learning
Instructor: Dr Eamonn Keogh
Computer Science & Engineering Department
318 Winston Chung Hall
University of California - Riverside
Riverside, CA 92521
[email protected]
Get the slides now!
www.cs.ucr.edu/~eamonn/public/DSW.pdf
www.cs.ucr.edu/~eamonn/public/DSW.ppt
Some slides adapted from Tan, Steinbach and Kumar, and from Chris Clifton
Machine Learning
Machine learning explores the study and construction of algorithms
that can learn from data.
Basic Idea: Instead of trying to create a very complex
program to do X. Use a (relatively) simple program
that can learn to do X.
Example: Instead of trying to program a car to drive
(If light(red) && NOT(pedestrian) || speed(X) <= 12 && .. ),
create a program that watches human drive, and
learns how to drive*.
*Currently, self driving cars do a bit of both.
Why Machine Learning I
Why do machine learning instead of just writing an explicit program?
• It is often much cheaper, faster and more accurate.
• It may be possible to teach a computer something that we are not sure
how to program. For example:
• We could explicitly write a program to tell if a person is obese
If (weightkg /(heightm  heightm))
> 30, printf(“Obese”)
•We would find it hard to write a program to tell is a person is sad
However, we could easily obtain a 1,000
photographs of sad people/ not sad
people, and ask a machine learning
algorithm to learn to tell them apart.
What kind of data do you want to work with?
•
•
•
•
•
•
•
•
Insects
Stars
Books
Mice
Counties
Emails
Historical manuscripts
People
–
–
–
–
–
As potential terrorists
As potential voters for your candidate
As potential heart attack victims
As potential tax cheats
etc
What kind of data do you want to work with?
•
•
•
•
•
•
•
•
Insects
No matter what kind of
data you want to work
Stars
with, it is best if you can
Books
“massage” it into a
rectangular flat file..
Mice
This may be easy, or…
Counties
Emails
Historical manuscripts
People
–
–
–
–
As potential terrorists
As potential voters for your candidate
As potential heart attack victims
As potential tax cheats
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
What is Data?
Collection of objects and their
attributes
Attributes
An attribute is a property or
characteristic of an object
Examples: eye color of a person,
temperature, etc.
Attribute is also known as variable,
field, characteristic, or feature
A collection of attributes describe an
object
Objects
Objects are also known as records,
points, cases, samples, entities,
exemplars or instances
10
Objects could be a customer, a patient,
a car, a country, a novel, a drug, a
movie etc
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Data Dimensionality and Numerosity
The number of attributes is the
dimensionality of a dataset.
Attributes
The number of objects is the
numerosity (or just size) of a dataset.
Some of the algorithms we want to
use, may scale badly in the
dimensionality, or scale badly in the
numerosity (or both).
As we will see, reducing the
dimensionality and/or numerosity of
data is a common task in data mining.
Objects
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
The Classification Problem
Katydids
(informal definition)
Given a collection of annotated data.
In this case 5 instances Katydids of
and five of Grasshoppers, decide
what type of insect the unlabeled
example is.
Katydid or Grasshopper?
Grasshoppers
The Classification Problem
Canadian
(informal definition)
Given a collection of annotated data.
In this case 3 instances Canadian of
and 3 of American, decide what type
of coin the unlabeled example is.
American
Canadian or American?
For any domain of interest, we can measure features
Color {Green, Brown, Gray, Other}
Abdomen
Length
Has Wings?
Thorax
Length
Antennae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
Sidebar 1
In data mining, we usually don’t have a choice of what features to
measure. The data is not usually collect with data mining in mind.
The features we really want may not be available: Why?
____________________
____________________
We typically have to use (a subset) of whatever data we are given.
Sidebar 2
In data mining, we can sometimes generate new features.
For example Feature X = Abdomen Length/ Antennae Length
Abdomen
Length
Antennae
Length
We can store features
in a database.
The classification
problem can now be
expressed as:
• Given a training database
(My_Collection), predict
the class label of a
previously unseen instance
My_Collection
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
previously unseen instance =
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
11
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
5.1
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydids
???????
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Grasshoppers
We will also use this lager dataset
as a motivating example…
Antenna Length
10
9
8
7
6
5
4
3
2
1
Katydids
Each of these data
objects are called…
• exemplars
• (training) examples
• instances
• tuples
1 2 3 4 5 6 7 8 9 10
Abdomen Length
We will return to the
previous slide in two minutes.
In the meantime, we are
going to play a quick game.
I am going to show you some
classification problems which
were shown to pigeons!
Let us see if you are as
smart as a pigeon!
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
Pigeon Problem 1
Examples of
class A
3
4
1.5
6
5
8
What class is
this object?
Examples of
class B
5
2.5
5
2
8
3
8
What about this
one, A or B?
4.5
2.5
5
4.5
3
1.5
7
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
This is a B!
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
8
1.5
Here is the rule.
If the left bar is
smaller than the
right bar, it is an A,
otherwise it is a B.
Pigeon Problem 2
Examples of
class A
Oh! This ones
hard!
Examples of
class B
4
4
5
2.5
5
5
2
5
6
6
5
3
8
Even I know this
one
7
3
3
2.5
3
1.5
7
Pigeon Problem 2
Examples of
class A
Examples of
class B
4
4
5
2.5
5
5
2
5
The rule is as follows,
if the two bars are
equal sizes, it is an A.
Otherwise it is a B.
So this one is an A.
6
6
5
3
7
3
3
2.5
3
7
Pigeon Problem 3
Examples of
class A
Examples of
class B
6
4
4
5
6
1
5
7
5
6
3
4
8
3
7
7
7
6
This one is really hard!
What is this, A or B?
Pigeon Problem 3
Examples of
class A
It is a B!
Examples of
class B
6
4
4
5
6
6
1
5
7
5
6
3
4
8
3
7
7
7
The rule is as follows,
if the square of the
sum of the two bars is
less than or equal to
100, it is an A.
Otherwise it is a B.
Why did we spend so much
time with this stupid game?
Because we wanted to
show that almost all
classification problems
have a geometric
interpretation, check out
the next 3 slides…
Examples of
class A
3
Examples of
class B
5
4
2.5
Left Bar
Pigeon Problem 1
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5
5
5
2
6
8
8
3
2.5
5
4.5
3
Here is the rule again.
If the left bar is smaller
than the right bar, it is
an A, otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
2.5
Left Bar
Pigeon Problem 2
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
5
5
2
5
6
6
5
3
3
3
2.5
3
Let me look it up… here it is..
the rule is, if the two bars
are equal sizes, it is an A.
Otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
6
Left Bar
Pigeon Problem 3
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
Right Bar
1
5
7
5
6
3
4
8
3
7
7
7
The rule again:
if the square of the sum of the
two bars is less than or equal
to 100, it is an A. Otherwise it
is a B.
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
previously unseen instance =
11
5.1
7.0
???????
• We can “project” the
previously unseen instance
into the same space as the
database.
Antenna Length
10
9
8
7
6
5
4
3
2
1
• We have now abstracted
away the details of our
particular problem. It will
be much easier to talk about
points in space.
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
Simple Linear Classifier
10
9
8
7
6
5
4
3
2
1
R.A. Fisher
1890-1962
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
Simple Quadratic Classifier
Simple Cubic Classifier
Simple Quartic Classifier
Simple Quintic Classifier
Simple…..
10
9
8
7
6
5
4
3
2
1
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
The simple linear classifier is defined for higher dimensional spaces…
… we can visualize it as
being an n-dimensional
hyperplane
It is interesting to think about what would happen in this example if
we did not have the 3rd dimension…
We can no longer get perfect
accuracy with the simple linear
classifier…
We could try to solve this
problem by user a simple
quadratic classifier or a simple
cubic classifier..
However, as we will later see,
this is probably a bad idea…
Which of the “Pigeon Problems” can be
solved by the Simple Linear Classifier?
1) Perfect
2) Useless
3) Pretty Good
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Problems that can be
solved by a linear
classifier are call
linearly separable.
10
9
8
7
6
5
4
3
2
1
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
1 2 3 4 5 6 7 8 9 10
Revisiting Sidebar 2
What would happen if we created a new feature Z, where:
Z= abs(X.value - X.value)
All blue points are perfectly
aligned, so we can only see one
1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Virginica
A Famous Problem
R. A. Fisher’s Iris Dataset.
• 3 classes
• 50 of each class
Setosa
The task is to classify Iris plants
into one of 3 varieties using the
Petal Length and Petal Width.
Iris Setosa
Versicolor
Iris Versicolor
Iris Virginica
We can generalize the piecewise linear classifier to N classes, by
fitting N-1 lines. In this case we first learned the line to (perfectly)
discriminate between Setosa and Virginica/Versicolor, then we
learned to approximately discriminate between Virginica and
Versicolor.
Virginica
Setosa
Versicolor
If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width…
We have now seen one classification
algorithm, and we are about to see more.
How should we compare them?
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise, missing values and irrelevant features,
streaming data
• Interpretability:
– understanding and insight provided by the model
Predictive Accuracy I
Hold Out Data
• How do we estimate the accuracy of our classifier?
We can use Hold Out data
We divide the dataset into 2 partitions, called train and test. We build our models
on train, and see how well we do on test.
Insect
ID
Abdomen
Length
Antennae
Length
1
2.7
5.5
Insect Class
Grasshopper
train
10
9
8
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
7
6
5
4
3
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
6
2.9
1.9
Grasshopper
8
0.5
1.0
Grasshopper
7
6.1
6.6
Katydid
9
8.3
6.6
Katydid
8
0.5
1.0
Grasshopper
10
8.1
4.7
Katydids
9
8.3
6.6
Katydid
10
8.1
4.7
Katydids
test
2
1
1
2
3
4
5
6
7
8
9 10
Predictive Accuracy II
• How do we estimate the accuracy of our classifier?
We can use K-fold cross validation
We divide the dataset into K equal sized sections. The algorithm is tested K times,
each time leaving out one of the K section from building the classifier, but using it
to test the classifier instead
Accuracy =
K=5
Number of correct classifications
Number of instances in our database
Insect
ID
Abdomen
Length
Antennae
Length
Insect Class
10
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
9
8
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydids
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
The Default Rate
How accurate can we be if we use no features?
The answer is called the Default Rate, the size of the most common
class, over the size of the full dataset. Default Rate  size (most common class )
size (dataset )
No features
Examples:
I want to predict the sex of some pregnant friends unborn
baby. The most common class is ‘boy’, so I will always
say ‘boy’.
101
 50.024%
101  100
I do just a tiny bit better than random guessing.
I want to predict the sex of the nurse that will give me a
flu shot next week.
The most common class is ‘female’, so I will say
‘female’.
266634
 85.29%
266634  45971
Predictive Accuracy III
• Using K-fold cross validation is a good way to set any parameters we may need to
adjust in (any) classifier.
• We can do K-fold cross validation for each possible setting, and choose the model
with the highest accuracy. Where there is a tie, we choose the simpler model.
• Actually, we should probably penalize the more complex models, even if they are more
accurate, since more complex models are more likely to overfit (discussed later).
Accuracy = 94%
10
9
8
7
6
5
4
3
2
1
Accuracy = 99%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Accuracy = 100%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Predictive Accuracy III
Accuracy =
Number of correct classifications
Number of instances in our database
Accuracy is a single number, we may be better off looking at a
confusion matrix. This gives us additional useful information…
True label is...
Cat Dog Pig
Classified as a…
Cat
Dog
Pig
100 0
9 90
45 45
0
1
10
Speed and Scalability I
We need to consider the time and space requirements
for the two distinct phases of classification:
• Time to construct the classifier
• In the case of the simpler linear classifier, the time taken to fit the line, this
is linear in the number of instances.
• Time to use the model
• In the case of the simpler linear classifier, the time taken to test which side
of the line the unlabeled instance is. This can be done in constant time.
10
As we shall see, some classification
algorithms are very efficient in one aspect,
and very poor in the other.
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Robustness I
We need to consider what happens when we have:
• Noise
• For example, a persons age could have been mistyped as 650
instead of 65, how does this effect our classifier? (This is important
only for building the classifier, if the instance to be classified is noisy we can
do nothing).
•Missing values
• For example suppose we want to classify
an insect, but we only know the abdomen
length (X-axis), and not the antennae
length (Y-axis), can we still classify the
instance?
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Robustness II
We need to consider what happens when we have:
• Irrelevant features
For example, suppose we want to classify people as either
• Suitable_Grad_Student
• Unsuitable_Grad_Student
And it happens that scoring more than 5 on a particular
test is a perfect indicator for this problem…
10
9
8
7
6
5
4
3
2
1
If we also use
“hair_length” as a
feature, how will this
effect our classifier?
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Robustness III
We need to consider what happens when we have:
• Streaming data
For many real world problems, we don’t have a single fixed
dataset. Instead, the data continuously arrives, potentially
forever… (stock market, weather data, sensor data etc)
Can our classifier handle streaming data?
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Interpretability
As a trivial example, if we try to classify
peoples health risks based on just their
height and weight, we could gain the
following insight (Based of the observation
that a single linear classifier does not work
well, but two linear classifiers do).
There are two ways to be unhealthy,
being obese and being too skinny.
Weight
Some classifiers offer a bonus feature. The structure
of the learned classifier tells use something about the
domain.
Height
Nearest Neighbor Classifier
Antenna Length
10
9
8
7
6
5
4
3
2
1
Evelyn Fix
Joe Hodges
1904-1965
1922-2000
If the nearest instance to the previously
unseen instance is a Katydid
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
We can visualize the nearest neighbor algorithm in terms of
a decision surface…
Note the we don’t actually have
to construct these surfaces, they
are simply the implicit
boundaries that divide the
space into regions “belonging”
to each instance.
This division of space is called
Dirichlet Tessellation (or Voronoi
diagram, or Theissen regions).
The nearest neighbor algorithm is sensitive to outliers…
The solution is to…
We can generalize the nearest neighbor algorithm to
the K- nearest neighbor (KNN) algorithm.
We measure the distance to the nearest K instances, and let
them vote. K is typically chosen to be an odd number.
K=1
K=3
The nearest neighbor algorithm is sensitive to irrelevant features…
Suppose the following is true, if
an insects antenna is longer than
5.5 it is a Katydid, otherwise it
is a Grasshopper.
Training data
1 2 3 4 5 6 7 8 9 10
6
1 2 3 4 5 6 7 8 9 10
Using just the antenna length we
get perfect classification!
1 2 3 4 5 6 7 8 9 10
5
Suppose however, we add in
an irrelevant feature, for
example the insects mass.
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Using both the antenna length
and the insects mass with the
1-NN algorithm we get the
wrong classification!
How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?
• Use more training instances
• Ask an expert what features are relevant to the task
• Use statistical tests to try to determine which
features are useful
• Search over feature subsets (in the next slide we
will see why this is hard)
Why searching over feature subsets is hard
Suppose you have the following classification problem, with 100
features, where is happens that Features 1 and 2 (the X and Y
below) give perfect classification, but all 98 of the other features are
irrelevant…
Only Feature 2
Only Feature 1
Using all 100 features will give poor results, but so will using only
Feature 1, and so will using Feature 2! Of the 2100 –1 possible
subsets of the features, only one really works.
1,2
1
2
3
4
1,3
2,3
1,4
2,4
1,2,3
•Forward Selection
•Backward Elimination
•Bi-directional Search
1,2,4
1,3,4
1,2,3,4
3,4
2,3,4
The nearest neighbor algorithm is sensitive to the units of measurement
X axis measured in centimeters
Y axis measure in dollars
The nearest neighbor to the pink
unknown instance is red.
X axis measured in millimeters
Y axis measure in dollars
The nearest neighbor to the pink
unknown instance is blue.
One solution is to normalize the units to pure numbers. Typically the
features are Z-normalized to have a mean of zero and a standard
deviation of one. X = (X – mean(X))/std(x)
We can speed up nearest neighbor algorithm by “throwing
away” some data. This is called data editing.
Note that this can sometimes improve accuracy!
We can also speed up classification with indexing
One possible approach.
Delete all instances that are
surrounded by members of
their own class.
Up to now we have assumed that the nearest neighbor algorithm uses
the Euclidean Distance, however this need not be the case…
DQ, C    qi  ci 
n
2
DQ, C  
p
i 1
10
9
8
7
6
5
4
3
2
1
p


q

c
 i i
n
i 1
Max (p=inf)
Manhattan (p=1)
Weighted Euclidean
Mahalanobis
1 2 3 4 5 6 7 8 9 10
…In fact, we can use the nearest neighbor algorithm with
any distance/similarity function
For example, is “Faloutsos” Greek or Irish? We
could compare the name “Faloutsos” to a
database of names using string edit distance…
edit_distance(Faloutsos, Keogh) = 8
edit_distance(Faloutsos, Gunopulos) = 6
Hopefully, the similarity of the name (particularly
the suffix) to other Greek names would mean the
nearest nearest neighbor is also a Greek name.
ID
1
2
3
4
5
6
7
8
Name
Class
Gunopulos Greek
Papadopoulos Greek
Kollios
Dardanos
Keogh
Gough
Greenhaugh
Hadleigh
Greek
Greek
Irish
Irish
Irish
Irish
Specialized distance measures exist for DNA strings, time series,
images, graphs, videos, sets, fingerprints etc…
Edit Distance Example
It is possible to transform any string Q into
string C, using only Substitution, Insertion
and Deletion.
Assume that each of these operators has a
cost associated with it.
How similar are the names
“Peter” and “Piotr”?
Assume the following cost function
Substitution
Insertion
Deletion
1 Unit
1 Unit
1 Unit
D(Peter,Piotr) is 3
The similarity between two strings can be
defined as the cost of the cheapest
transformation from Q to C.
Peter
Note that for now we have ignored the issue of how we can find this cheapest
transformation
Substitution (i for e)
Piter
Insertion (o)
Pioter
Deletion (e)
Piotr
Setting Parameters and Overfitting
You need to classify widgets, you get a training set..
• You could use a Linear Classifier or Nearest Neighbor …
Model Selection
• Nearest Neighbor
•You could use 1NN, 3NN, 5NN…
• You could use Euclidean Distance, LP1, Lpinf, Mahalanobis…
• You could do some data editing…
• You could do some feature weighting…
• You could ….
Parameter Selection
• “Linear Classifier”
• You could use a Constant classifier
Or parameter
• You could use a Linear Classifier
tuning, tweaking
• You could use a Quadratic Classifier
• You could….
Setting parameters and overfitting
You need to classify widgets, you get a training set..
• You could use a Linear Classifier or Nearest Neighbor …
• Nearest Neighbor
•You could use 1NN, 3NN, 5NN…
• You could use Euclidean Distance, LP1, Lpinf, Mahalanobis…
• You could do some data editing…
• You could do some feature weighting…
• You could ….
• “Linear Classifier”
• You could use a Constant classifier
• You could use a Linear Classifier
• You could use a Quadratic Classifier
• You could….
Overfitting
Overfitting occurs when a statistical
model describes random error or noise instead of the
underlying relationship.
Overfitting generally occurs when a model is
excessively complex, such as having too many
parameters relative to the number of observations.
A model which has been overfit will generally have
poor predictive performance, as it can exaggerate
minor fluctuations in the data.
Suppose we need to solve a classification problem
We are not sure if we should use the..
• Simple linear classifier
or the
• Simple quadratic classifier
How do we decide which to use?
We do cross validation or
leave-one out and choose the
best one.
• Simple linear classifier gets 81% accuracy
• Simple quadratic classifier 99% accuracy
100
90
80
70
60
50
40
30
20
10
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
• Simple linear classifier gets 96% accuracy
• Simple quadratic classifier 97% accuracy
This problem is greatly exacerbated by
having too little data
• Simple linear classifier gets 90% accuracy
• Simple quadratic classifier 95% accuracy
What happens as we have more and more training examples?
The accuracy for all models goes up!
The chance of making a mistake (choosing the wrong model) goes down
Even if we make a mistake, it will not matter too much (because we would
learn a degenerate quadratic that is basically a straight line)
• Simple linear 70% accuracy
• Simple quadratic 90% accuracy
• Simple linear 90% accuracy
• Simple quadratic 95% accuracy
• Simple linear 99.999999% accuracy
• Simple quadratic 99.999999% accuracy
One Solution: Charge Penalty for complex models
• For example, for the simple {polynomial} classifier, we could
“charge” 1% for every increase in the degree of the polynomial
• Simple linear classifier gets 90.5%
• Simple quadratic classifier 97.0%
• Simple cubic classifier 97.05%
Accuracy = 90.5%
10
9
8
7
6
5
4
3
2
1
accuracy, minus 0, equals 90.5%
accuracy, minus 1, equals 96.0%
accuracy, minus 2, equals 95.05%
Accuracy = 97.0%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Accuracy = 97.05%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
One Solution: Charge Penalty for complex models
• For example, for the simple {polynomial} classifier, we could charge 1% for every
increase in the degree of the polynomial.
• There are more principled ways to charge penalties
• In particular, there is a technique called Minimum
Description Length (MDL)
Appendix
Types of Attributes
• There are different types of attributes
– Nominal (includes Boolean)
• Examples: ID numbers, eye color, zip codes, sex
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties it possesses:
= 
< >
+ */
–
–
–
–
Distinctness:
Order:
Addition:
Multiplication:
–
–
–
–
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Properties of Attribute Values
– Nominal attribute: distinctness
– We can say
• Jewish = Jewish
• Catholic  Muslim
– We cannot say
• Jewish < Buddist
• (Jewish + Muslim)/2
• Sqrt(Atheist)
allowed
Key:
Atheist: 1
Jewish: 2
Buddist:3
Name Religio
n
Ad
Joe
1
12
Sue
2
61
Cat
1
34
Even though
(2<3)
Bob
3
Tim
1
3 is
Even thoughJinSqrt(1)
65
54
44
Properties of Attribute Values
– Ordinal attribute: distinctness & order
– We can say {newborn, infant, toddler, child, teen, adult}
• infant = infant
• newborn < toddler
– We cannot say
• newborn + child
• infant / newborn
• log(child)
Key:
newborn: 1
infant: 2
toddler:3
etc
Name lifestag
e
Ad
Joe
1
12
Sue
2
61
Properties of Attribute Values
• There are a handful of tricky cases….
– Ordinal attribute: distinctness & order
– If we have {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday}
– Then we can clearly say
• Sunday = Sunday
• Sunday != Tuesday
– But can we say Sunday < Tuesday?
– A similar problem occurs with degree of an angle…
Properties of Attribute Values
– Interval attribute: distinctness, order & addition
– Suppose it is 10 degrees Celsius
– We can say it is not 11 degrees Celsius
• 10  11
– We can say it is colder than 15 degrees Celsius
• 10 < 15
– We can say closing a window will make it two degrees hotter
• NewTemp = 10 + 2
– We cannot say that it is twice as hot as 5 degrees Celsius
• 10 / 2 = 5
No!
Properties of Attribute Values
• The type of an attribute depends on which of
the following properties it possesses:
– Ratio attribute: all 4 properties
– We can do anything!
• So 10kelvin really is twice as hot as 5kelvin
– Of course, distinctness is tricky to define with real
numbers.
• is 3.1415926535897 = 3.141592653589?
Attribute
Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye
color, sex: {male,
female}
mode, entropy,
contingency
correlation, 2
test
Ordinal
The values of an ordinal
attribute provide enough
information to order objects. (<,
>)
median,
percentiles, rank
correlation, run
tests, sign tests
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
For ratio variables, both
differences and ratios are
meaningful. (*, /)
hardness of
minerals, {good,
better, best},
grades, street
numbers
calendar dates,
temperature in
Celsius or
Fahrenheit
Ratio
temperature in
Kelvin, monetary
quantities, counts,
age, mass, length,
electrical current
mean, standard
deviation,
Pearson's
correlation, t and
F tests
geometric mean,
harmonic mean,
percent variation
Attribute
Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal
An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing the
notion of good, better best can
be represented equally well by
the values {1, 2, 3} or by {
0.5, 1, 10}, or by {A, B, C}
Interval
new_value =a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio
new_value = a * old_value
Length can be measured in
meters or feet.
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– As a practical matter, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Discrete and Continuous Attributes
• We can convert between Continuous and Discrete variables.
– For example, below we have converted real-valued heights to ordinal {short, medium, tall}
• Conversions of Discrete to Continuous are less common, but possible.
•
•
Why convert? Sometimes the algorithms we
what to use are only defined for a certain
type of data. For example, hashing or Bloom
filters are best defined for Discrete data.
Conversion may involve making choices, for
example, how many “heights”, where do we
place the cutoffs (equal width, equal bin
counts etc.) These choices may effect the
performance of the algorithms.
6’3’’
3
5’1’’
1
5’7’’
2
5’3’’
1
{short, medium, tall}
1,
2,
3
Discrete and Continuous Attributes
• We can convert between Discrete and Continuous variables.
– For example, below we have converted discrete words to a real-valued time
series
In
the
beginning
God
created
the
heaven
and
the
earth.
And
the
earth
was
without
form,
and
void;
and
darkness
was
upon
the
face
of
the
deep.
And
the
Spirit
of
God
moved
upon
the
face
of
The
waters.
And
God
Said
::
::
::
::
::
::
With
you
all.
Amen.
There are 783,137 words in the King James Bible
There are 12,143 unique words in the King James Bible
Local frequency of “God” in King James Bible
0
0
1
2
3
4
5
6
7
8
x 10
5
Even if the data is given to you as continuous, it
might really be intrinsically discrete
partID
size
Ad
12323
7.801
12
5324
7.802
61
75654
32.09
34
34523
32.13
65
424
47.94
54
25324
62.07
44
Even if the data is given to you as continuous, it
might really be intrinsically discrete
0
10000
stroke
Bing Hu, Thanawin
Rakthanmanon, Yuan Hao, Scott
Evans, Stefano Lonardi, and
Eamonn Keogh
(2011). Discovering the Intrinsic
Cardinality and Dimensionality of
Time Series using MDL. ICDM
2011
glide
0
push off
20000
stroke
glide
1000
2000
push off
glide
3000
4000
Data can be Missing
• Data can be missing for many reasons.
–
–
–
–
Someone may decline-to-state
The attribute may be the result of an expensive test
Sensor failure
etc
Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value
Data can be “Missing”: Special Case
• In some case we expect most of the data to be missing.
•
•
•
•
Consider a dataset containing people’s rankings of movies (or books, or music etc)
The dimensionality is very high, there are lots of movies
However, most people have only seen a tiny fraction of these
So the movie ranking database will be very sparse.
•
Some platforms/languages explicitly support sparse matrices (including Matlab)
•
Here, inferring a missing value is equivalent to asking a question like “How much would
Joe like the movie MASH?” See “Collaborative filtering” / “ Recommender Systems”
Joe
Jaws
E.T.
4
1
MAS
H
May
Argo
3
4
Brave
OZ
4
Van
Sue
Ted
2
4
5
5
4
Bait
Document Data is also Sparse
• Each document is a `term' vector
(vector-space representation)
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term occurs in the
document.
Doc2
Doc4
document-term matrix
the
Doc1
42
Doc2
22
Doc3
32
Doc4
29
Doc5
9
harry
rebel
god
1
cat
dog
1
13
1
help
near
1
0
1
56
5
1
3
•
Graph Data is also Typically
Sparse
The elements of the matrix indicate whether pairs of vertices are connected or
not in the graph.
Not all datasets naturally fit neatly into a rectangular matrix…
We may have to deal with such data as special cases.
DNA Data
First 100 base pairs of the chimp’s mitochondrial DNA:
gtttatgtagcttaccccctcaaagcaatacactgaaaatgtttcgacgggtttacatcaccccataaacaaacaggtttggtcctagcctttctattag
First 100 base pairs of the human’s mitochondrial DNA:
gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctg
Transaction Data
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Spatio-Temporal Dat
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
–
–
–
–
Redundancy
Noise and outliers
Missing values
Duplicate data
Redundancy
• Various subsets of the features are often related or correlated in
some way. They are partly redundant with each other.
• For problems in signal processing, this redundancy is typically
helpful. But for data mining, redundancy almost always hurts.
0.3
0.4
0.4
0.5
0.6
0.8
0.8
0.9
0.8
0.7
Height F/I
Height Meters
Weight
1
4’10’’
1.47
166
2
6’3’’
1.90
210
3
5’11’
1.80
215
4
5’4’’
1.62
150
Why Redundancy Hurts
• Some data mining algorithms scale poorly in dimensionality, say O(2D). For the
problem below, this means we take O(23) time, when we really only needed
O(22) time.
• We can see some data mining algorithms as counting evidence across a row
(Nearest Neighbor Classifier, Bayes Classifier etc). If we have redundant
features, we will “overcount” evidence.
• It is probable that the redundant features will add errors.
•
For example, suppose that person 1 really is exactly 4’10’’.
Then they are exactly 1.4732m, but the system recorded them
as 1.47m . So we have introduced 0.0032m of error. This is a
tiny amount, but it we had 100s of such attributes, we would
be introducing a lot of error.
• The curse of dimensionality (discussed later in the quarter)
As we will see in the course, we
can try to fix this issue with data
aggregation, dimensionality
reduction techniques, feature
selection, feature generation etc
Height F/I
Height Meters
Weight
1
4’10’’
1.47
166
2
6’3’’
1.90
210
3
5’11’
1.80
215
4
5’4’’
1.62
150
Detecting Redundancy
• By creating a scatterplot of “Height F/I” vs. “Height Meters” we can see the
redundancy, and measure it with correlation.
• However, if we have 100 features, we clearly cannot visual check 1002
scatterplots.
• Note that two features can have zero correlation, but still be related/redundant.
There are more sophisticated tests of “relatedness”
Height F/I
Height Meters
Weight
1
4’10’’
1.47
166
2
6’3’’
1.90
210
3
5’11’
1.80
215
4
5’4’’
1.62
150
Detecting Redundancy
Height F/I
Height Meters
Weight
1
4’10’’
1.47
166
2
6’3’’
1.90
210
3
5’11’
1.80
215
4
5’4’’
1.62
150
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
quality phone.
The two images are one man’s ECGs, taken about an hour apart. The different are mostly due to sensor noise
(MIT-BIH Atrial Fibrillation Database record afdb/08405)