Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distances and Kernels Based on Cumulative Distribution Functions Hongjun Su and Hong Zhang* Department of Computer Science and Information Technology Armstrong Atlantic State University Savannah, GA 31419 USA E-mail: [email protected], [email protected] *contact author Conference: IPCV'14 Keywords: Cumulative Distribution Function, Distance, Kernel, Similarity Abstract Similarity and dissimilarity measures such as kernels and distances are key components of classification and clustering algorithms. We propose a novel technique to construct distances and kernel functions between probability distributions based on cumulative distribution functions. The proposed distance measures incorporate global discriminating information and can be computed efficiently. 1. Introduction The kernel is a similarity measure that is the key component of support vector machine ([4]) and other machine learning techniques. More generally, a distance (a metric) is a function that represents the dissimilarity between objects. In many pattern classification and clustering applications, it is useful to measure the similarity between probability distributions. A large number of divergence and affinity measures on distributions has already been defined in traditional statistics. These measures are typically based on the probability density functions and are not effective in detecting global changes. In this paper, we propose a family of distances and kernels that are defined on the cumulative distribution functions, instead of densities. This paper is organized as follows. Section 2 introduces kernels and distances commonly defined on probability distributions. In Section 3, a new family of distance and kernel functions based on cumulative distribution functions is proposed. Experimental results on Gaussian mixture distributions are presented in Section 4. In Section 5 we provide conclusions and future works. 2. Distance and Similarity Between Distributions Measures Given two probability distributions, there are well known measures for the differences or similarities between the two distributions. The Bhattacharyya affinity ([1]) is a measure of similarity between two distributions: B( p, q ) = ∫ p ( x)q ( x) dx In [7], the probability product kernel is defined as a generalization of Bhattacharyya affinity: k prob ( p, q ) = ∫ p ( x) ρ q ( x) ρ dx The Bhattacharyya distance is a dissimilarity measure related to the Bhattacharyya affinity: ( DB ( p, q ) = − ln ∫ p ( x)q ( x) dx ) The Hellinger distance ([6]) is another metric on distributions: D H ( p, q ) = 1 2∫ ( ) 2 p ( x) − q ( x) dx The Kullback-Leibler divergence ([8]) is defined as: p( x ) p( x )dx DKL ( p, q) = ∫ ln q( x ) The earth mover’s distance (EMD), also known as the Wasserstein metric ([3]), is defined as 1/ p All these similarity/dissimilarity measures are based on the point-wise comparisons of the probability density functions. As a result, they are inherently local comparison measures of the density functions. They perform well on smooth, Gaussian-like distributions. However, on discrete and multimodal distributions, they may not reflect the similarities and can be sensitive to noises and small perturbations in data. Example. Let p be the simple discrete distribution with a single point mass at the origin and q the perturbed version with the mass shifted by a. (Figure 1) p( x ) = δ ( x ) q( x ) = δ ( x − a ) p(x) q(x) W p ( µ ,ν ) = inf ∫ d ( x, y ) p dγ γ ∈Γ ( µ ,ν ) where Γ( µ ,ν ) denotes the set of all couplings of µ and ν . EMD does measure the global movements between the distributions. However, the computation of EMD involves solving optimization problems and is much more complex than the density based divergence measures. Related to the distance measures are the statistical tests to determine if two samples are drawn from different distributions. Examples of such tests include the Kolmogorov-Smirnov statistic ([10]) and the kernel based tests ([5]). 3. Distances on Cumulative Distribution Functions A cumulative distribution function (CDF) of a random variable X is defined as 0 F ( x ) = P( X < x ) a Figure 1. Distributions The Bhattacharyya affinity and divergence values are easy to calculate: ( d p ( F , G ) = ∫ | F ( x) − G ( x) | p dx B( p, q) = ∫ p( x )q( x )dx = 0 ( ) DB ( p, q) = − ln ∫ p( x )q( x )dx = ∞ DH ( p, q) = 1 2∫ ( Let F and G be CDFs for the random variables with bounded ranges (i.e. their density functions have bounded supports). For p ≥ 1 , we define the distance between the CDFs as ) 2 p( x ) − q( x ) dx = ∞ p( x ) p( x )dx = ∞ DKL ( p, q) = ∫ ln q( x ) All these values are independent of a. They indicate minimal similarity and maximal dissimilarity. It is easy to verify that ) 1/ p d p ( F , G ) is a metric. It is symmetric and satisfies the triangle inequality. Because CDFs are left-continuous, d p ( F , G ) = 0 implies that F =G. p = 2 , a kernel can be derived from the distance d 2 ( F , G) : When k ( F , G ) = e −αd 2 ( F ,G ) 2 To show that k is indeed a kernel, consider a kernel matrix M = [ k ( Fi , F j )], 1 ≤ i, j ≤ n . Let [ a, b] be a finite interval that covers the support of all density functions pi ( x), 1 ≤ i ≤ n . 1/ 2 b d 2 ( Fi , F j ) = ∫ | Fi ( x) − F j ( x) |2 dx a This metric is induced from the norm of the Hilbert 2 0 space L ([a, b]) . Consequently the kernel matrix M is positive semi-definite, since it is the kernel matrix of the Gaussian kernel for kernel. L2 ([a, b]) . Therefore, k is a Remarks. The formula for d p ( F , G ) resembles the p L (R ) . However a CDF p F cannot be an element of L (R ) because lim F ( x) = 1 . The condition of bounded support will x →∞ guarantee the convergence of the integral. In practical applications, this will not likely be a limitation. Theoretically the integral could be divergent without this constraint. For example, let F be the step function at 0 and G ( x) = x /( x + 1), x ≥ 0 . Then ∞ 0 1 dx = ∞ x +1 Given a data sample, ( X 1 , X 2 ,, X n ) , an empirical CDF can be constructed as: Fn ( x ) = Figure 2. CDFs The proposed distance function has the value: metric induced by the norm in d1 ( F , G ) = ∫ a 1 n ∑ I X <x n k =1 i which can be used to approximate the distance d p ( F , G) . When p = ∞ , we have d ∞ ( F , G ) = max | F ( x) − G ( x) | x The distance d ∞ is similar to the Kolmogorov-Smirnov statistic ([10]). Example. Consider the same example as in the previous section. The CDFs are illustrated in Figure 2. 1/ p a d p ( F , G ) = ∫ 1dx 0 = a1 / p For p < ∞ , the distance value is dependent on a. The kernel value is: k ( F , G ) = e −α d 2 ( F , G ) = e −α 2 a The computation of the distance d p ( F , G ) is straightforward. For a discrete dataset of size n, the complexity for computing the distance is O (n) . On the other hand, the computation of earth mover’s distance requires the Hungarian algorithm ([9]) with a complexity of O(n 3 ) . 4. Experimental Results and Discussions The CDF based kernels and distances can be effective on continuous distributions as well. A Gaussian mixture distribution ([2]) and its variations, shown in Figure 3, are used to test the kernel functions. The first chart shows the original Gaussian mixture. The other two distributions are obtained by moving the middle mode. Clearly the second distribution is much closer to the original distribution than the third one. 0.775 0.715 1 0.715 1 0.775 0.715 0.715 1 0.14 0.12 0.1 0.08 The Bhattacharyya kernel did not clearly distinguish the second and the third distributions when comparing to the original. There is no significant difference between the kernel values k12 and k13 , which measure the 0.06 0.04 similarities between the original distribution and the two varied distributions. 0.02 0 0 50 100 150 200 250 300 350 The kernel matrix of our proposed kernel is: 1 0.123 1.67 ×10 −6 0.14 0.12 0.1 0.08 0.123 1.67 ×10 −6 4.68 ×10 −5 1 1 4.68 ×10 −5 The CDF based kernel showed much greater performance in this example. The kernel values k12 and 0.06 k13 clearly reflect the larger deviation (less similarity) 0.04 of the third distribution from the original. 0.02 0 0 50 100 150 200 250 300 350 This is due to the fact that the density based Bhattacharyya kernel does not capture the global variations. The CDF based kernel is much more effective in detecting global changes in the distributions. 0.14 0.12 5. Conclusions and Future Work 0.1 0.08 0.06 0.04 0.02 0 0 50 100 150 200 250 300 350 Figure 3. A Gaussian mixture and variations Indexed in the same order as Figure 3, the Bhattacharyya kernel matrix for the three distributions is: In this paper, we presented a new family of distance and kernel functions on probability distributions based on the cumulative distribution functions. The distance function was shown to be a metric and the kernel function was shown be a positive definite kernel. Compared to the traditional density based divergence functions, our proposed distance measures are more effective in detecting global discrepancy in distributions. Experimental results on generated distributions were discussed. This method can be extended to high dimensional distributions. The advantages of CDF can be maintained in the high dimensional cases. However, there will be significant cost in directly computing high dimensional CDFs. We plan to investigate specifically the 2D extension which could yield useful results for image processing. Acknowledgment. The authors wish to thank the referees for their extremely helpful comments and suggestions. 6. References [1] Bhattacharyya, A., "On a measure of divergence between two statistical populations defined by their probability distributions". Bulletin of the Calcutta Mathematical Society 35: 99–109, (1943). [2] Bishop, Christopher, Pattern recognition and machine learning. New York: Springer, (2006). [3] Bogachev, V.I.; Kolesnikov, A.V. "The MongeKantorovich problem: achievements, connections, and perspectives". Russian Math. Surveys 67: 785–890. [4] Boser, B. E.; Guyon, I. M.; Vapnik, V. N., "A training algorithm for optimal margin classifiers". Proceedings of the fifth annual workshop on Computational learning theory COLT '92. p. 144, (1992). [5] Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Scholkopf, B.; Smola, A., “A kernel two-sample test”, J. Machine Learning Research, 13, 723-773, (2012). [6] Hellinger, Ernst, "Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen", Journal für die reine und angewandte Mathematik (in German) 136: 210–271, (1909). [7] Jebara, T.; Kondor, R.; Howard, A., "Probability Product Kernels," J. Machine Learning Research, 5, 819-844, (2004). [8] Kullback, S.; Leibler, R.A., "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86, (1951). [9] Munkres, J., "Algorithms for the Assignment and Transportation Problems", Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, (1957). [10] Smirnov, N.V., "Approximate distribution laws for random variables, constructed from empirical data" Uspekhi Mat. Nauk , 10 pp. 179–206 (In Russian), (1944).