Download Distances and Kernels Based on Cumulative Distribution Functions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distribution (mathematics) wikipedia , lookup

Transcript
Distances and Kernels Based on Cumulative Distribution Functions
Hongjun Su and Hong Zhang*
Department of Computer Science and Information Technology
Armstrong Atlantic State University
Savannah, GA 31419 USA
E-mail: [email protected], [email protected]
*contact author
Conference: IPCV'14
Keywords: Cumulative Distribution Function, Distance, Kernel, Similarity
Abstract
Similarity and dissimilarity measures such as kernels
and distances are key components of classification and
clustering algorithms. We propose a novel technique to
construct distances and kernel functions between
probability distributions based on cumulative
distribution functions. The proposed distance measures
incorporate global discriminating information and can
be computed efficiently.
1. Introduction
The kernel is a similarity measure that is the key
component of support vector machine ([4]) and other
machine learning techniques. More generally, a distance
(a metric) is a function that represents the dissimilarity
between objects.
In many pattern classification and clustering
applications, it is useful to measure the similarity
between probability distributions. A large number of
divergence and affinity measures on distributions has
already been defined in traditional statistics. These
measures are typically based on the probability density
functions and are not effective in detecting global
changes.
In this paper, we propose a family of distances and
kernels that are defined on the cumulative distribution
functions, instead of densities.
This paper is organized as follows. Section 2 introduces
kernels and distances commonly defined on probability
distributions. In Section 3, a new family of distance and
kernel functions based on cumulative distribution
functions is proposed. Experimental results on Gaussian
mixture distributions are presented in Section 4. In
Section 5 we provide conclusions and future works.
2. Distance and Similarity
Between Distributions
Measures
Given two probability distributions, there are well
known measures for the differences or similarities
between the two distributions.
The Bhattacharyya affinity ([1]) is a measure of
similarity between two distributions:
B( p, q ) = ∫ p ( x)q ( x) dx
In [7], the probability product kernel is defined as a
generalization of Bhattacharyya affinity:
k prob ( p, q ) = ∫ p ( x) ρ q ( x) ρ dx
The Bhattacharyya distance is a dissimilarity measure
related to the Bhattacharyya affinity:
(
DB ( p, q ) = − ln ∫ p ( x)q ( x) dx
)
The Hellinger distance ([6]) is another metric on
distributions:
D H ( p, q ) =
1
2∫
(
)
2
p ( x) − q ( x) dx
The Kullback-Leibler divergence ([8]) is defined as:
 p( x ) 
 p( x )dx
DKL ( p, q) = ∫  ln
 q( x ) 
The earth mover’s distance (EMD), also known as the
Wasserstein metric ([3]), is defined as
1/ p
All these similarity/dissimilarity measures are based on
the point-wise comparisons of the probability density
functions. As a result, they are inherently local
comparison measures of the density functions. They
perform well on smooth, Gaussian-like distributions.
However, on discrete and multimodal distributions, they
may not reflect the similarities and can be sensitive to
noises and small perturbations in data.
Example. Let p be the simple discrete distribution with
a single point mass at the origin and q the perturbed
version with the mass shifted by a. (Figure 1)
p( x ) = δ ( x )
q( x ) = δ ( x − a )
p(x)
q(x)
W p ( µ ,ν ) =  inf ∫ d ( x, y ) p dγ 
 γ ∈Γ ( µ ,ν )

where Γ( µ ,ν ) denotes the set of all couplings of µ
and ν . EMD does measure the global movements
between the distributions. However, the computation of
EMD involves solving optimization problems and is
much more complex than the density based divergence
measures.
Related to the distance measures are the statistical tests
to determine if two samples are drawn from different
distributions. Examples of such tests include the
Kolmogorov-Smirnov statistic ([10]) and the kernel
based tests ([5]).
3. Distances on Cumulative Distribution
Functions
A cumulative distribution function (CDF) of a random
variable X is defined as
0
F ( x ) = P( X < x )
a
Figure 1. Distributions
The Bhattacharyya affinity and divergence values are
easy to calculate:
(
d p ( F , G ) = ∫ | F ( x) − G ( x) | p dx
B( p, q) = ∫ p( x )q( x )dx = 0
(
)
DB ( p, q) = − ln ∫ p( x )q( x )dx = ∞
DH ( p, q) =
1
2∫
(
Let F and G be CDFs for the random variables with
bounded ranges (i.e. their density functions have
bounded supports). For p ≥ 1 , we define the distance
between the CDFs as
)
2
p( x ) − q( x ) dx = ∞
 p( x ) 
 p( x )dx = ∞
DKL ( p, q) = ∫  ln
 q( x ) 
All these values are independent of a. They indicate
minimal similarity and maximal dissimilarity.
It is easy to verify that
)
1/ p
d p ( F , G ) is a metric. It is
symmetric and satisfies the triangle inequality. Because
CDFs are left-continuous, d p ( F , G ) = 0 implies that
F =G.
p = 2 , a kernel can be derived from the distance
d 2 ( F , G) :
When
k ( F , G ) = e −αd 2 ( F ,G )
2
To show that k is indeed a kernel, consider a kernel
matrix M = [ k ( Fi , F j )], 1 ≤ i, j ≤ n . Let [ a, b] be
a finite interval that covers the support of all density
functions pi ( x), 1 ≤ i ≤ n .
1/ 2
b
d 2 ( Fi , F j ) =  ∫ | Fi ( x) − F j ( x) |2 dx 
a


This metric is induced from the norm of the Hilbert
2
0
space L ([a, b]) . Consequently the kernel matrix M is
positive semi-definite, since it is the kernel matrix of the
Gaussian kernel for
kernel.
L2 ([a, b]) . Therefore, k is a
Remarks. The formula for
d p ( F , G ) resembles the
p
L (R ) . However a CDF
p
F cannot be an element of L (R ) because
lim F ( x) = 1 . The condition of bounded support will
x →∞
guarantee the convergence of the integral. In practical
applications, this will not likely be a limitation.
Theoretically the integral could be divergent without
this constraint. For example, let F be the step function
at 0 and G ( x) = x /( x + 1), x ≥ 0 . Then
∞
0
1
dx = ∞
x +1
Given a data sample, ( X 1 , X 2 ,, X n ) , an empirical
CDF can be constructed as:
Fn ( x ) =
Figure 2. CDFs
The proposed distance function has the value:
metric induced by the norm in
d1 ( F , G ) = ∫
a
1 n
∑ I X <x
n k =1 i
which can be used to approximate the distance
d p ( F , G) .
When p = ∞ , we have
d ∞ ( F , G ) = max | F ( x) − G ( x) |
x
The distance d ∞ is similar to the Kolmogorov-Smirnov
statistic ([10]).
Example. Consider the same example as in the previous
section. The CDFs are illustrated in Figure 2.
1/ p
a
d p ( F , G ) =  ∫ 1dx 
 0

= a1 / p
For p < ∞ , the distance value is dependent on a. The
kernel value is:
k ( F , G ) = e −α d 2 ( F , G ) = e −α
2
a
The computation of the distance
d p ( F , G ) is
straightforward. For a discrete dataset of size n, the
complexity for computing the distance is O (n) . On the
other hand, the computation of earth mover’s distance
requires the Hungarian algorithm ([9]) with a
complexity of
O(n 3 ) .
4. Experimental Results and Discussions
The CDF based kernels and distances can be effective
on continuous distributions as well.
A Gaussian mixture distribution ([2]) and its variations,
shown in Figure 3, are used to test the kernel functions.
The first chart shows the original Gaussian mixture. The
other two distributions are obtained by moving the
middle mode. Clearly the second distribution is much
closer to the original distribution than the third one.
0.775 0.715 
 1


0.715 
1
 0.775
 0.715 0.715
1 

0.14
0.12
0.1
0.08
The Bhattacharyya kernel did not clearly distinguish the
second and the third distributions when comparing to
the original. There is no significant difference between
the kernel values k12 and k13 , which measure the
0.06
0.04
similarities between the original distribution and the two
varied distributions.
0.02
0
0
50
100
150
200
250
300
350
The kernel matrix of our proposed kernel is:

1

 0.123
1.67 ×10 −6

0.14
0.12
0.1
0.08
0.123
1.67 ×10 −6 

4.68 ×10 −5 
1

1
4.68 ×10 −5

The CDF based kernel showed much greater
performance in this example. The kernel values k12 and
0.06
k13 clearly reflect the larger deviation (less similarity)
0.04
of the third distribution from the original.
0.02
0
0
50
100
150
200
250
300
350
This is due to the fact that the density based
Bhattacharyya kernel does not capture the global
variations. The CDF based kernel is much more
effective in detecting global changes in the distributions.
0.14
0.12
5. Conclusions and Future Work
0.1
0.08
0.06
0.04
0.02
0
0
50
100
150
200
250
300
350
Figure 3. A Gaussian mixture and variations
Indexed in the same order as Figure 3, the
Bhattacharyya kernel matrix for the three distributions
is:
In this paper, we presented a new family of distance and
kernel functions on probability distributions based on
the cumulative distribution functions. The distance
function was shown to be a metric and the kernel
function was shown be a positive definite kernel.
Compared to the traditional density based divergence
functions, our proposed distance measures are more
effective in detecting global discrepancy in
distributions. Experimental results on generated
distributions were discussed.
This method can be extended to high dimensional
distributions. The advantages of CDF can be maintained
in the high dimensional cases. However, there will be
significant cost in directly computing high dimensional
CDFs. We plan to investigate specifically the 2D
extension which could yield useful results for image
processing.
Acknowledgment. The authors wish to thank the referees
for their extremely helpful comments and suggestions.
6. References
[1] Bhattacharyya, A., "On a measure of divergence between
two statistical populations defined by their probability
distributions". Bulletin of the Calcutta Mathematical Society
35: 99–109, (1943).
[2] Bishop, Christopher, Pattern recognition and machine
learning. New York: Springer, (2006).
[3] Bogachev, V.I.; Kolesnikov, A.V. "The MongeKantorovich problem: achievements, connections, and
perspectives". Russian Math. Surveys 67: 785–890.
[4] Boser, B. E.; Guyon, I. M.; Vapnik, V. N., "A training
algorithm for optimal margin classifiers". Proceedings of the
fifth annual workshop on Computational learning theory COLT '92. p. 144, (1992).
[5] Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Scholkopf, B.;
Smola, A., “A kernel two-sample test”, J. Machine Learning
Research, 13, 723-773, (2012).
[6] Hellinger, Ernst, "Neue Begründung der Theorie
quadratischer Formen von unendlichvielen Veränderlichen",
Journal für die reine und angewandte Mathematik (in German)
136: 210–271, (1909).
[7] Jebara, T.; Kondor, R.; Howard, A., "Probability Product
Kernels," J. Machine Learning Research, 5, 819-844, (2004).
[8] Kullback, S.; Leibler, R.A., "On Information and
Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86,
(1951).
[9] Munkres, J., "Algorithms for the Assignment and
Transportation Problems", Journal of the Society for Industrial
and Applied Mathematics, 5(1):32–38, (1957).
[10] Smirnov, N.V., "Approximate distribution laws for
random variables, constructed from empirical data" Uspekhi
Mat. Nauk , 10 pp. 179–206 (In Russian), (1944).