Download AN ADAPTIVE METRIC MACHINE FOR PATTERN CLASSIFICATION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Classification Problem
• Given {( xn , yn )}nN1
-
- - -- - - x2
- -
xn  q
yn {, }
- + + + + +
+
+ +
+
+ + +
- - +
++ +
+
-x +
0
+
+
+
+
+
+
.
x1
• Predict class label of a given query x 0
Classification Problem
• Unknown probability distribution P( x, y )
• We need to estimate:
P (  | x 0 )  f  ( x0 )
P (  | x 0 )  f  ( x0 )
The Bayesian Classifier
• Loss function:
 j | k 
• Expected loss (conditional risk) associated with class j:
J
• Bayes rule:
R( j | x )     j | k  Pk | x 
k 1
j *  arg min R( j | x )
1 j J
• Zero-one loss function:
0 if j  k
 j | k   
1 if j  k
j *  arg max P( j | x ) Bayes rule
1 j J
The Bayesian Classifier
j *  arg max P( j | x )
1 j J
• Bayes rule achieves the minimum error rate
• How to estimate the posterior probabilities:
P j | x 
J
j 1
ˆj  x   arg max Pˆ ( j | x )
1 j J
Density estimation
• Use Bayes theorem to estimate the posterior probability
values:
P( j | x ) 
p x | j P j 
J
 p x | k Pk 
k 1
p x | j 
class
P j 
is the probability density function of
j
is the prior probability of class
j
x
given
Naïve Bayes Classifier
• Makes the assumption of independence of features
given the class:
p x | j   px1 , x2 ,, xq | j    pxi | j 
q
i 1
• The task of estimating a q-dimensional density function
is reduced to the estimation of q one-dimensional density
functions. Thus, the complexity of the task is drastically
reduced.
• The use of Bayes theorem becomes much simpler.
• Proven to be effective in practice.
Nearest-Neighbor Methods
• Predict the class label of x0 as the most frequent
one occurring in the K neighbors
- - - -+ ++ + +
-- - - - + ++
+ +
-- + +
- + +
x2
+
+ +
- -+
+
- +
+
- - - ++ ++
-
.
x1
Nearest-Neighbor Methods
• Predict the class label of x0 as the most frequent
one occurring in the K neighbors
- - - -+ ++ + +
-- - - - + ++
+ +
-- + +
- + +
x2
+
+ +
- -- +
+
+
- +
+
- - - ++ ++
-
x1
Nearest-Neighbor Methods
• Predict the class label of x0 as the most frequent
one occurring in the K neighbors
- - - -+ ++ + +
-- - - - + ++
+ +
-- + +
- + +
x2
+
+ +
- -+
+
- +
+
- - - ++ ++
-
.
..
x1
Basic assumption:
f  ( x   x)  f  ( x)
f  ( x   x)  f  ( x)
for small
 x
Example: Letter Recognition
First statistical
moment
.
. .
Edge count
Asymptotic Properties of
K-NN Methods
lim N  fˆj  x   f j ( x )
if lim N  K   and lim N  K / N  0
• The first condition reduces the variance by making the estimation
independent of the accidental characteristics of the K nearest
neighbors.
• The second condition reduces the bias by assuring that the K
nearest neighbors are arbitrarily close to the query point.
Asymptotic Properties of
K-NN Methods
lim N  E1  2E
E1  classification error rate of the 1-NN rule
E  classification error rate of the Bayes rule
In the asymptotic limit no decision rule is more
than twice as accurate as the 1-NN rule
Finite-sample settings
• How
well the 1-NN rule works in finitesample settings?
• If the number of training data N is large and the number
of input features q is small, then the asymptotic results may
still be valid.
• However, for a moderate to large number of input
variables, the sample required for their validity is
beyond feasibility.
Curse-of-Dimensionality
• This phenomenon is known as
the curse-of-dimensionality
• It refers to the fact that in high dimensional
spaces data become extremely sparse and
are far apart from each other
• It affects any estimation problem with
high dimensionality
Curse of Dimensionality
DMAX/DMIN
DMAX
DMIN
Sample of size N=500 uniformly distributed in
[0, 1]q
Curse of Dimensionality
dim
The distribution of the ratio DMAX/DMIN
converges to 1 as the dimensionality increases
Curse of Dimensionality
dim
Variance of distances from a given point
Curse of Dimensionality
dim
The variance of distances from a given point
converges to 0 as the dimensionality increases
Curse of Dimensionality
Distance values from a given point
Values flatten out as dimensionality increases
Computing radii of nearest neighborhoods
median radius of a nearest neighborhood
uniform distributi on in the unit cube - .5,.5
q
Curse-of-Dimensionality
As dimensionality increases, the distance from the
closest point increases faster
• Random sample of size N ~ uniform distribution in the
q -dimensional unit hypercube
• Diameter of a K  1neighborhood using Euclidean
distance: d (q, N )  O( N 1/ q )
q
N
d(q,N)
4
4
6
6
10
10
100
1000
100
1000
1000
10000
10000
0.71
0.48
0.91
0.72
1.51
0.42 0.23
20
20
20
106
1010
1.20 0.76
Large d (q, N )  Highly biased estimations
Curse-of-Dimensionality
• It is a serious problem in many
real-world applications
• Microarray data: 3,000-4,000 genes;
• Documents: 10,000-20,000 words in
dictionary;
• Images, face recognition, etc.
How can we deal with
the curse of dimensionality?
7.68 92.2 
92.2 1912.5


 x1 
 1  1 N
x    μ      xi
 x2 
  2  N i 1
2  2 covariance matrix :


E  x  μ  x  μ  
T
 x   

E  1 1  x1  1 , x2   2  
 x2   2 

  x1  1 2
E
 x1  1  x2   2 
1
N


x1  1 x2   2  

2
 x2   2  
2
i

x1  1


i
i
i 1 
 x1  1 x2   2
N



x
i
1


 1 x2i   2 

2
i
x2   2



1
N


2

x1i  1
 i

i
x


x
i 1 
1
2  2
 1
N



x
i
1


 1 x2i   2 

2
i
x2   2



variance
covariance
2
1 N i

x1  1


N
 N i 1
i
i
1
x


x

1
1
2  2
 N i 1


covariance



1
N
 x   x
N
i 1
1
N
i
1

N
i 1




1
2 

2

x2i   2

i
2

variance
0.94 0.93
0.93 1.03


0.97 0.49
0.49 1.04 


0.93 0.01
0.01 1.05


 0.99  0.5
 0.5 1.06 


 1.04  1.05
 1.05 1.15 


Dimensionality Reduction
• Many dimensions are often
interdependent (correlated);
We can:
• Reduce the dimensionality of problems;
• Transform interdependent coordinates
into significant and independent ones;