Download Nearest Neighbors - WCU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
NEAREST NEIGHBORS
Class 16
CSC 600: Data Mining
Today…



Measures of Similarity
Distance Measures
Nearest Neighbors
Similarity and Dissimilarity Measures

Used by a number of data mining techniques:
 Nearest
neighbors
 Clustering
…
How to measure “proximity”?

Proximity: similarity or dissimilarity between two objects
 Similarity:
numerical measure of the degree to which two
objects are alike
 Usually


in range [0,1]
0 = no similarity
1 = complete similarity
 Dissimilarity:
different
measure of the degree in which two objects are
Feature Space



Abstract n-dimensional space
Each instance is plotted in a feature space
One axis for each descriptive feature
 Difficult
to visualizes when # of features > 3
As the differences between the values of the descriptive features grows, so too does the
distance between the points in the feature space that represent these instances.
Distance Metric


Distance Metric: A function that returns the distance
between two instances a and b.
Must, mathematically have the criteria:
1.
2.
3.
4.
Non-negativity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 ≥ 0
Identity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 = 0 ⟺ 𝑎 = 𝑏
Symmetry: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 = 𝑚𝑒𝑡𝑟𝑖𝑐(𝑏, 𝑎)
Triangular Inequality:
𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑐 ≤ 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 + 𝑚𝑒𝑡𝑟𝑖𝑐(𝑏, 𝑐)
Dissimilarities between Data Objects

A common measure for the proximity between two
objects is the Euclidean Distance:
d(x, y) =
n
å(x
k=1


2
y
)
k
k
x and y are two data objects
n dimensions
In high school, we typically used this for calculating the distance
between two objects, when there were only two dimensions.
Defined for one-dimension, two-dimensions, three-dimensions, …,
any n-dimensional space
Example
𝑑 𝑖𝑑12 , 𝑖𝑑5 =
5.00 − 2.75
2
+ 2.50 − 7.50
2
= 30.0625 = 5.4829
Dissimilarities between Data Objects

Typically the Euclidean Distance is used as a first
choice when applying Nearest Neighbors and
Clustering
d(x, y) =
n
2
(x
y
)
å k k
k=1

x and y are two data objects
n dimensions
Other distance metrics:
 Generalized by the Minkowski d(x, y) =
distance metric:
n
1r
å xk - yk
r
k=1
r is a parameter
Dissimilarities between Data Objects

Minkowski Distance Metric:
r
=1
d(x, y) =
1r
n
åx
k
- yk
r
k=1
 L1
norm
 “Manhattan”, “taxicab” Distance
r
=2
 Euclidean
 L2
Distance
norm
The larger the value of p the more emphasis is placed on the features with large differences
in values because these differences are raised to the power of p.
Distance Matrix


Once a distance metric is chosen, the proximity
between all of the objects in the dataset can be
computed
Can be represented in a distance matrix
 Pairwise
distances between points
Distance Matrix
3
p1
2
p3
p4
1
L1 norm distance. “Manhattan” Distance.
L1
p1
p2
p3
p1
0
4
4
p2
4
0
2
p3
4
2
0
p4
6
4
2
p4
6
4
2
0
L2 norm distance. Euclidean Distance.
L2
p1
p2
p3
p1
0
2.828
3.162
p2
2.828
0
1.414
p3
3.162
1.414
0
p4
5.099
3.162
2
p4
5.099
3.162
2
0
p2
0
0
1
2
3
4
5
6
Using Weights


So far, all attributes treated equally when
computing proximity
In some situations: some features are more important
than others
 Decision

up to the analyst
Modified Minkowski distance definition to include
1r
weights:
n
d(x, y) =
åw
k=1
k
xk - yk
r
Eager Learner Models

So far in this course, we’re performed prediction by:
1.
2.
3.
Downloading or constructing a dataset
Learning a model
Using model to classify/predict test instances
Sometimes called eager learners:


Designed to learn a model that maps the input attributes
to the class label, as soon as training data becomes
available.
Lazy Learner Models

Opposite strategy:
 Delay
process of modeling the training data, until it is
necessary to classify/predict a test instance.
 Example:
 Nearest
neighbors
Nearest Neighbors

k-nearest neighbors
k
= parameter, chosen by analyst
 For a given test instance, use the k “closest” points
(nearest neighbors) for performing classification
 “closest”
points: defined by some proximity metric, such as
Euclidean Distance
Unknown record
l
Requires three things
1. The set of stored records
2. Distance Metric to compute distance
between records
3. The value of k, the number of nearest
neighbors to retrieve
l
To classify an unknown record:
1. Compute distance to other training
records
2. Identify k nearest neighbors
3. Use class labels of nearest neighbors to
determine the class label of unknown
record (e.g., by taking majority vote)
Definition of Nearest Neighbor

The k-nearest neighbors of a given example x are
the k points that are closest to x. • Classification changes
depending on the chosen
k
• Majority Voting
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
Tie Scenario:
• Randomly choose
classification?
• For binary problems,
usually an odd k is used to
avoid ties.
Voronoi Tessellations and
Decision Boundaries

When k-NN is searching for the nearest neighbor, it
is partitioning the abstract feature space into a
Voronoi tessellation
 Each
region belongs to an instance
 Contains all the points in the space whose distance to
that instance is less than the distance to any other
instance
Voronoi
tessellation
(a) (a)
Voronoi
tessellation
Decision
boundary
= 1)
(b) (b)
Decision
boundary
(k (k
= 1)
Figure:
Voronoi
tessellation
of the
feature
space
Figure:
(a) (a)
TheThe
Voronoi
tessellation
of the
feature
space
forfor
thethe
[25]
dataset
in Table
position
of the
query
represented
dataset
in Table
2 [25]2 withwith
thethe
position
of the
query
represented
byby
thethe
? marker;
(b)
decision
boundary
created
aggregating
? marker;
(b) regions
thethe
decision
boundary
created
aggregating
thethe
boundary
between
of the
feature
spacebyinby
which
different
neighboring
Voronoi
regions
belong
to the
same
target
level.
neighboring
Voronoi
regions
thatthat
belong
to the
same
target
level.
Decision Boundary: the
target levels will be predicted.
• Generate the decision boundary by aggregating the neighboring regions that make the
same prediction.
(a) Voronoi tessellation (b) Decision
(b)boundary
Decision (k
boundary
(a) Voronoi tessellation
= 1) (k = 1)
One of the great things about
update the model very easily.
Big Idea
Fundamentals
Standard Approach
Epilogue
Summary
(a) The
Voronoi
tessellation
of the
feature
for the
Idea
Fundamentals of the feature
Standard Approach
Epilogue
Summary
Figure:neighbor
(a)Figure:
TheBigVoronoi
tessellation
space
for space
thein
nearest
algorithms
is
that
we
can
add
new
data
to
[25]
A Worked Example
dataset
in
Table
2
with
the
position
of
the
query
represented
by
the
[25]
dataset in TableA Worked
2 Example
with the position of the query represented by the
? marker; (b) the decision boundary created by aggregating the
? marker; (b)
the decision boundary created by aggregating the
neighboring Voronoi regions that belong to the same target level.
neighboring Voronoi regions that belong to the same target level.
(a) Voronoi tessellation
(a) Voronoi tessellation
(b) Decision boundary (k = 1)
(b) Decision boundary (k = 1)
What’s up with the
top-right instance?
ndard Approach
Epilogue
Summary



Is it noise?
The decision boundary is likely not
ideal because of id13.
k-NN is a set of local models, each
defined by a single instance

(b) Decision boundary (k = 1)
tion of the feature space for the
sition of the query represented by the
dary created by aggregating the

Sensitive to noise!
How to mitigate noise?

Choose a higher value of k.
Which is the ideal value of k?
Different Values of k
andard Approach
Epilogue
Summary
Noisy Data
(b) Decision boundary (k = 1)
k=1
Normalization
Cont. Targets
Similarity
Feat. Sel.Noisy Data
Efficiency Normalization
Summary
Setting k to a high value is riskier with
an imbalanced dataset.
• Majority target level begins to
dominate the feature space.
Cont. Targets
Similarity
Feat. Sel.Noisy DataEfficiency Normalization
Summary
k=5
Cont. Targets
Similarity
Feat. Sel.
k=15
Efficiency
Figure: The decision boundary using majority classification
of the
Figure:
The decision boundary using majority classification of theFigure: The decision boundary when k is set to 15.
nearest 3 neighbors.
nearest 5 neighbors.
tion of the feature space for the
osition of the query represented by the
dary created by aggregating the
t belong to the same target level.
k=3
Ideal decision boundary
Choose k by running evaluation experiments on a training or validation set.
Choosing the right k

If k is too small, sensitive to
noise points in the training
data


Susceptible to overfitting
If k is too large,
neighborhood may include
points from other classes

Susceptible to
misclassification
X
Computational Issues?


Computation can be costly if the number of training
examples is large.
Efficient indexing techniques are available to
reduce the amount of computations needed when
finding the nearest neighbors of a test example
 Sorting
training instances?
Majority Voting

Every neighbor has the same impact on the
classification: Majority Voting: y¢ = argmax
I (v = y )
å
v

i
( xi ,yi )ÎDz
Distance-weighted:
 Far
away neighbors have a weaker impact on the
classification.
Distance-Weighted Voting: y¢ = argmax
v
å
( xi ,yi )ÎDz
wi ´ I (v = yi )
Scaling Issues


Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes
Example, with three dimensions:



height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
Income will dominate if these variables aren’t standardized.
Standardization


Treat all features “equally” so that one feature doesn’t
dominate the others
Common treatment to all variables:

Standardize each variable:
Mean = 0
 Standard Deviation = 1


Normalize each variable:
Max = 1
 Min = 0

Query / test instance
to classify:
• Salary = 56000
• Age = 35
Final Thoughts on Nearest Neighbors

Nearest-neighbors classification is part of a more
general technique called instance-based learning
 Use

specific instances for prediction, rather than a model
Nearest-neighbors is a lazy learner
 Performing
the classification can be relatively
computationally expensive
 No model is learned up-front
Classifier Comparison
Eager Learners



Decision Trees, SVMs
Model Building:
potentially slow
Classifying Test
Instance: fast
Lazy Learners



Nearest Neighbors
Model Building: fast
(because there is none!)
Classifying Test
Instance: slow
Classifier Comparison
Eager Learners


Decision Trees, SVMs
finding a global model
that fits the entire input
space
Lazy Learners


Nearest Neighbors
classification decisions
are made locally (small
k values), and are more
susceptible to noise
References






Fundamentals of Machine Learning for Predictive Data
Analytics, 1st Edition, Kelleher et al.
Data Science from Scratch, 1st Edition, Grus
Data Mining and Business Analytics in R, 1st edition, Ledolter
An Introduction to Statistical Learning, 1st edition, James et
al.
Discovering Knowledge in Data, 2nd edition, Larose et al.
Introduction to Data Mining, 1st edition, Tam et al.