Download K Nearest Neighbor Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fast Similarity Metric Based Data Mining
Techniques Using P-trees:
k-Nearest Neighbor Classification
 Distance metric based computation using P-trees
 A new distance metric, called HOBbit distance
 Some useful properties of P-trees
 New P-tree Nearest Neighbor classification method
- called Closed-KNN
Data Mining
extracting knowledge from a large amount of data
Useful Information
Data Mining
(sometimes 1 bit: Y/N)
More data volume
=
less information
Raw data
Information Pyramid
Functionalities: feature selection, association rule mining,
classification & prediction, cluster analysis, outlier analysis
Classification
Predicting the class of a data object
also called Supervised learning
Training data: Class labels are known and supervise the learning
Feature1
Feature2
Feature3
Class
a1
b1
c1
A
a2
b2
c2
A
a3
b3
c3
B
Sample with unknown class:
a
b
c
Classifier
Predicted class
Of the Sample
Eager classifier: Builds a classifier model in advance
e.g. decision tree induction, neural network
Lazy classifier: Uses the raw training data
e.g. k-nearest neighbor
Clustering (unsupervised learning – cpt 8)
The process of grouping objects into classes,
with the objective: the data objects are
• similar to the objects in the same cluster
• dissimilar to the objects in the other clusters.
A two dimensional space
showing 3 clusters
Clustering is often called unsupervised learning
or unsupervised classification
 the class labels of the data objects are unknown
Distance Metric (used in both classification and clustering)
Measures the dissimilarity between two data points.
A metric is a fctn, d, of 2 n-dimensional points X and Y, such that
d(X, Y) is positive definite:
if (X  Y), d(X, Y) > 0
if (X = Y), d(X, Y) = 0
d(X, Y) is symmetric:
d(X, Y) = d(Y, X)
d(X, Y) satisfies triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)
Various Distance Metrics
Minkowski distance or Lp distance,
 n
d p X ,Y   
xi  y i 
 i 1

n
Manhattan distance, d1  X , Y    xi  yi

(P = 1)
i 1
Euclidian distance,
d 2 X ,Y  
n
2


x

y
 i i (P = 2)
i 1
Max distance,
n
d   X , Y   max xi  y i
i 1
1
pp
(P = )
An Example
Y (6,4)
A two-dimensional space:
Manhattan, d1(X,Y) = XZ+ ZY = 4+3 = 7
Euclidian, d2(X,Y) = XY = 5
Z
X (2,1)
Max, d(X,Y) = Max(XZ, ZY) = XZ = 4
d1  d2  d
For any positive integer p,
d p  d p 1
Some Other Distances
n
Canberra distance
dc X ,Y   
i 1
Squared cord distance
d sc  X , Y  
xi  y i
xi  y i

n
xi 
i 1
Squared chi-squared distance
d chi  X , Y  
n

i 1
xi  yi 2
xi  yi
yi

2
HOBbit Similarity
Higher Order Bit (HOBbit) similarity:
m
HOBbitS(A, B) =
maxs : i 1  i  s  ai  bi 
s 0
A, B: two scalars (integer)
ai, bi : ith bit of A and B (left to right)
m : number of bits
Bit position: 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
x1: 0 1 1 0 1 0 0 1
y1: 0 1 1 1 1 1 0 1
x2: 0 1 0 1 1 1 0 1
y2: 0 1 0 1 0 0 0 0
HOBbitS(x1, y1) = 3
HOBbitS(x2, y2) = 4
HOBbit Distance (related to Hamming distance)
The HOBbit distance between two scalar value A and B:
dv(A, B) = m – HOBbit(A, B)
The previous example:
Bit position: 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
x1: 0 1 1 0 1 0 0 1
y1: 0 1 1 1 1 1 0 1
x2: 0 1 0 1 1 1 0 1
y2: 0 1 0 1 0 0 0 0
HOBbitS(x1, y1) = 3
HOBbitS(x2, y2) = 4
dv(x1, y1) = 8 – 3 = 5
dv(x2, y2) = 8 – 4 = 4
The HOBbit distance between two points X and Y:
n
n
i 1
i 1
d H  X,Y   maxd v xi ,yi   maxm - HOBxi ,yi 
In our example (considering 2-dimensional data):
dh(X, Y) = max (5, 4) = 5
HOBbit Distance Is a Metric
HOBbit distance is positive definite
if (X = Y), d H  X , Y  = 0
if (X  Y), d H  X , Y  > 0
HOBbit distance is symmetric
d H  X , Y   d H Y , X 
HOBbit distance holds triangle inequality
d H  X , Y   d H Y , Z   d H  X , Z 
Neighborhood of a Point
Neighborhood of a target point, T, is a set of points, S,
such that X  S if and only if d(T, X)  r
2r
2r
T
X
X
X
2r
2r
T
X
T
T
Manhattan
Euclidian
Max
If X is a point on the boundary, d(T, X) = r
HOBbit
Decision Boundary
decision boundary between points A and B, is the
R1
d(A,X)
X
A
locus of the point X satisfying d(A, X) = d(B, X)
d(B,X)
R2
B
D
A
A
B
B
Decision boundary for HOBbit Distance is perpendicular to axis that makes max distance
Manhattan
Euclidian
Max
Max
Euclidian
A
Manhattan
A
B
 > 45
Decision boundaries for Manhattan, Euclidean and max distance
B
 < 45
Minkowski Metrics
Lp-metrics (aka: Minkowski metrics) dp(X,Y) = (i=1 to n wi|xi - yi|p)1/p
(weights, wi assumed =1)
Unit Disks
Dividing Lines
p=1 (Manhattan)
p=2 (Euclidean)
p=3,4,…
.
.
Pmax (chessboard)
P=½, ⅓, ¼, …
dmax≡ max|xi - yi|  d≡ limp  dp(X,Y).
Proof (sort of) limp  { i=1 to n aip }1/p max(ai) ≡b. For p large enough, other aip << bp since y=xp
increasingly concave, so
i=1 to n aip  k*bp (k=duplicity of b in the sum), so {i=1 to n aip }1/p  k1/p*b and k1/p 1
P>1 Minkowski Metrics
q
2
4
9
100
MAX
x1
0.5
0.5
0.5
0.5
0.5
y1
0
0
0
0
0
x2
0.5
0.5
0.5
0.5
0.5
y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2)
0
0.25
0.25
0.7071067812
0
0.0625
0.0625
0.5946035575
0
0.001953125
0.001953125
0.5400298694
0 7.888609E-31 7.888609E-31
0.503477775
0
0.5
2
3
7
100
MAX
0.70
0.70
0.70
0.70
0.70
0
0
0
0
0
0.7071
0.7071
0.7071
0.7071
0.7071
0
0
0
0
0
0.5
0.3535533906
0.0883883476
8.881784E-16
0.5
0.3535533906
0.0883883476
8.881784E-16
1
0.8908987181
0.7807091822
0.7120250978
0.7071067812
2
8
100
1000
MAX
0.99
0.99
0.99
0.99
0.99
0
0
0
0
0
0.99
0.99
0.99
0.99
0.99
0
0
0
0
0
0.9801
0.9227446944
0.3660323413
0.0000431712
0.9801
0.9227446944
0.3660323413
0.0000431712
1.4000714267
1.0796026553
0.9968859946
0.9906864536
0.99
2
9
100
1000
MAX
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
1
1
1
1.4142135624
1.0800597389
1.0069555501
1.0006933875
1
2
9
100
1000
MAX
0.9
0.9
0.9
0.9
0.9
0
0
0
0
0
0.1
0.1
0.1
0.1
0.1
0
0
0
0
0
0.81
0.01
0.387420489
0.000000001
0.0000265614 **************
1.747871E-46
0
0.9055385138
0.9000000003
0.9
0.9
0.9
2
3
8
100
MAX
3
3
3
3
3
0
0
0
0
0
3
3
3
3
3
0
0
0
0
0
9
27
6561
5.153775E+47
4.2426406871
3.7797631497
3.271523198
3.0208666502
3
6
9
100
MAX
90
90
90
90
0
0
0
0
45
45
45
45
9
27
6561
5.153775E+47
0 531441000000
8303765625
0 3.874205E+17 7.566806E+14
0 ****************************
0
90.232863532
90.019514317
90
90
P<1 Minkowski Metrics
d 1/p(X,Y) = (i=1 to n |xi - yi|1/p)p
p=0 (lim as p0) doesn’t exist
P<1
(Does not converge.)
q
1
0.8
0.4
0.2
0.1
.04
.02
.01
2
x1 y1
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
x2 y2
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2)
0.1
0.1
0.2
0.1584893192
0.1584893192
0.237841423
0.3981071706
0.3981071706
0.5656854249
0.6309573445
0.6309573445
3.2
0.7943282347
0.7943282347
102.4
0.9120108394
0.9120108394
3355443.2
0.954992586
0.954992586
112589990684263
0.977237221
0.977237221 1.2676506002E+29
0.01
0.01
0.1414213562
q
1
0.8
0.4
0.2
0.1
0.04
0.02
0.01
2
x1 y1
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
x2 y2
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
0.5 0
e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2)
0.5
0.5
1
0.5743491775
0.5743491775
1.189207115
0.7578582833
0.7578582833
2.8284271247
0.8705505633
0.8705505633
16
0.9330329915
0.9330329915
512
0.9726549474
0.9726549474
16777216
0.9862327045
0.9862327045 5.6294995342E+14
0.9930924954
0.9930924954 6.3382530011E+29
0.25
0.25
0.7071067812
q
1
0.8
0.4
0.2
0.1
0.04
0.02
0.01
2
x1 y1
0.9 0
0.9 0
0.9 0
0.9 0
0.9 0
0.9 0
0.9 0
0.9 0
0.9 0
x2 y2
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
0.1 0
e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2)
0.9
0.1
1
0.9191661188
0.1584893192
1.097993846
0.9587315155
0.3981071706
2.14447281
0.9791483624
0.6309573445
10.8211133585
0.9895192582
0.7943282347
326.27006047
0.9957944476
0.9120108394
10312196.9619
0.9978950083
0.954992586
341871052443154
0.9989469497
0.977237221 3.8259705676E+29
0.81
0.01
0.9055385138
Min dissimilarity function
The dmin function (dmin(X,Y) = min i=1 to n |xi - yi| is strange. It is not a psuedometric. The Unit Disk is:
And the neighborhood of the blue point relative to the red point (dividing nbrhd those points closer to the blue than the red). Major bifurcations!
http://www.cs.ndsu.nodak.edu/~serazi/research/Distance.html
Other Interesting Metrics
Canberra metric:
dc(X,Y) = (i=1 to n |xi – yi| / (xi + yi)
- normalized manhattan distance
Square Cord metric:
dsc(X,Y) = i=1 to n ( xi – yi
)2
- Already discussed as Lp with p=1/2
Squared Chi-squared metric:
dchi(X,Y) = i=1 to n (xi – yi)2 / (xi + yi)
HOBbit metric (Hi Order Binary bit)
dH(X,Y) = max i=1 to n {n – HOB(xi - yi)}
where, for m-bit integers,
A=a1..am and B=b1..bm HOB(A,B) = max i=1 to m {s: i(1  i  s  ai=bi)}
(related to Hamming distance in coding theory)
Scalar Product metric:
dchi(X,Y) = X • Y = i=1 to n xi * yi
Hyperbolic metrics: (which map infinite space 1-1 onto a sphere)
Which are rotationally invariant? Translationally invariant? Other?
Notations
P1 & P2 : P1 AND P2 (also P1 ^ P2 )
rc(P) : root count of P-tree,
P1 | P2 : P1 OR P2
P
P´ : COMPLEMENT P-tree of P
N : number of pixels
Pi, j : basic P-tree for band-i bit-j.
n : number of bands
Pi(v) : value P-tree for value v of band i.
m : number of bits
Pi([v1, v2]) : interval P-tree for interval [v1, v2] of band i.
P0 : is pure0-tree, a P-tree having the root node which is pure0.
P1 : is pure1-tree, a P-tree having the root node which is pure1.
Properties of P-trees
1. a) rc P  0  P  P 0
b) rc P   N  P  P1
4. rc(P1 | P2) = 0
2. a) P & P  P
3. a)
P| P  P
b) P & P1  P
b)
P | P1  P1
c) P & P'  P 0
c)
P | P'  P1
d) P & P 0  P 0
d)
P | P0  P
iff rc(P1) = 0 and rc(P2) = 0
5.
v1  v2  rc{Pi (v1) & Pi(v2)} = 0
6.
rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2)
7.
rc{Pi (v1) | Pi(v2)} = rc{Pi (v1)} + rc{Pi(v2)}, where v1  v2
k-Nearest Neighbor Classification and Closed-KNN
1)
2)
3)
4)
Select a suitable value for k
Determine a suitable distance metric
Find k nearest neighbors of the sample using the selected metric
Find the plurality class of the nearest neighbors
by voting on the class labels of the NNs
5) Assign the plurality class to the sample to be classified.
T
T is the target pixels.
With k = 3, to find the third nearest neighbor,
KNN arbitrarily select one point from the bdry line of the nhbd
Closed-KNN includes all points on the boundary
Closed-KNN yields higher classification accuracy than traditional KNN
Searching Nearest Neighbors
We begin searching by finding the exact matches.
Let the target sample, T = <v1, v2, v3, …, vn>
The initial neighborhood is the point T.
We expand the neighborhood along each dimension:
along dim-i, [vi] expanded to the interval [vi – ai , vi+bi], for some pos integers ai and bi.
Continue expansion until there are at least k points in the neighborhood.
HOBbit Similarity Method for KNN
In this method, we match bits of the target to the training data
First, find those matching in all 8 bits of each band (exact matches)
let, bi,j = jth bit of the ith band of the target pixel.
Define target-Ptree, Pt: Pti,j
= Pi,j , if bi,j = 1
= Pi,j , otherwise
And precision-value-Ptree, Pvi,1j = Pti,1 & Pti,2 & Pti,3 & … & Pti,j
An Analysis of HOBbit Method
Let ith band value of the target T, vi = 105 = 01101001b
[01101001]  [105, 105]
1st expansion
[0110100-] = [01101000, 01101001] = [104, 105]
2nd expansion
[011010- -] = [01101000, 01101011] = [104, 107]
 Does not expand evenly in both side:
Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5
 And expands by power of 2.
 Computationally very cheap
Perfect Centering Method
Max distance metric provides better neighborhood by
- keeping the target in the center
- and expanding by 1 in both side
Initial neighborhood P-tree (exact matching):
Pnn = P1(v1) & P2(v2) & P3(v3) & … & Pn(vn)
If rc(Pnn) < k ,
Pnn = P1(v1-1, v1+1) & P2(v2-1, v2+1) & … & Pn(vn-1, vn+1)
If rc(Pnn) < k ,
Pnn = P1(v1-2, v1+2) & P2(v2-2, v2+2) & … & Pn(vn-2, vn+2)
Computationally costlier than HOBbit Similarity method
But a little better classification accuracy
Let, Pc(i) is the value P-trees for the class i
Plurality class =
arg max rc Pc (i) & Pnn 
i
Performance
Experimented on two sets of Arial photographs of The Best
Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND
Data contains 6 bands: Red, Green, Blue reflectance values, Soil
Moisture, Nitrate, and Yield (class label).
Band values ranges from 0 to 255 (8 bits)
Considering 8 classes or levels of yield values: 0 to 7
Performance – Accuracy
1997 Dataset:
80
75
Accuracy (%)
70
65
60
55
50
45
KNN-Manhattan
KNN-Euclidian
KNN-Max
KNN-HOBS
P-tree: Perfect Centering (closed-KNN)
P-tree: HOBS (closed-KNN)
40
256
1024
4096
16384
Training Set Size (no. of pixels)
65536
262144
Performance - Accuracy (cont.)
1998 Dataset:
65
60
55
Accuracy (%)
50
45
40
35
30
25
KNN-Manhattan
KNN-Euclidian
KNN-Max
KNN-HOBS
P-tree: Perfect Centering (closed-KNN)
P-tree: HOBS (closed-KNN)
20
256
1024
4096
16384
Training Set Size (no of pixels)
65536
262144
Performance - Time
1997 Dataset: both axis in logarithmic scale
Training Set Size (no. of pixels)
256
1024
4096
16384
65536
262144
Per Sample Classification time (sec)
1
0.1
0.01
0.001
0.0001
0.00001
KNN-Manhattan
KNN-Euclidian
KNN-Max
KNN-HOBS
P-tree: Perfect Centering (cosed-KNN)
P-tree: HOBS (closed-KNN)
Performance - Time (cont.)
1998 Dataset : both axis in logarithmic scale
256
Training Set Size (no. of pixels)
1024
4096
16384
65536
262144
Per Sample Classification Time (sec)
1
0.1
0.01
0.001
0.0001
0.00001
KNN-Manhattan
KNN-Euclidian
KNN-Max
KNN-HOBS
P-tree: Perfect Centering (closed-KNN)
P-tree: HOBS (closed-KNN)
Association of Computing Machinery KDD-Cup-02
NDSU Team
Related documents