Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational phylogenetics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Algorithm characterizations wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Pattern recognition wikipedia , lookup
Expectationβmaximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Graph coloring wikipedia , lookup
Travelling salesman problem wikipedia , lookup
Isograph: Neighbourhood Graph Construction Based on Geodesic Distance for Semi-Supervised Learning Marjan Ghazvininejad, Mostafa Mahdieh, Hamid R. Rabiee, Parisa Khanipour Roshan and Mohammad Hossein Rohban DML Research Lab, Department of Computer Engineering, Sharif University of Technology Tehran, Iran Contact Email: [email protected] AbstractβSemi-supervised learning based on manifolds has been the focus of extensive research in recent years. Convenient neighbourhood graph construction is a key component of a successful semi-supervised classification method. Previous graph construction methods fail when there are pairs of data points that have small Euclidean distance, but are far apart over the manifold. To overcome this problem, we start with an arbitrary neighbourhood graph and iteratively update the edge weights by using the estimates of the geodesic distances between points. Moreover, we provide theoretical bounds on the values of estimated geodesic distances. Experimental results on real-world data show significant improvement compared to the previous graph construction methods. Keywords- Semi-supervised Learning, Manifold, Geodesic distance, Graph Construction I. I NTRODUCTION A. Semi-supervised Learning The costly and time consuming process of data labeling, and the amount of relatively cheap unlabeled data at hand, are two elements of real-world applications that have caused a recent interest in applying Semi-Supervised learning (SSL) methods. Text and image mining are common examples of applications in which SSL plays an important role ([1], [2], [3]). SSL methods utilize both labeled and unlabeled data to improve the generalization ability of the learner in such applications. Using the unlabeled data, one can calculate the distribution of data in feature space, which can extremely improve the classification. In order to use the unlabeled data for label inference more efficiently, certain assumptions should be made about the general geometric properties of the data. In many applications, high dimensional data points are actually samples from a low-dimensional subspace of the actual feature space. In these cases, we can make use of the Manifold/Cluster assumption which is among the most practical assumptions in SSL [4]. Manifold/Cluster assumption is held in many real world datasets in general, and image datasets in particular [5]. Weighted discrete graphs are a suitable representation of manifolds. A manifold can be represented by sampling a finite number of points from the manifold as the graph vertices and putting edges between nearby points on the manifold. As the underlying manifold is unknown in real data and smoothness estimation of the labeling function heavily relies on the manifold model, graph construction plays an important role in this problem. Therefore, several methods have been proposed to construct a suitable graph representing the manifold. These graph construction methods, output a weighted graph in which each data point is vertex and edge weights illustrate the amount of distance between ending points. The constructed graph is then used to infer labels of unlabeled data points. Therefore, appropriate construction of the neighbourhood graph plays a key role in manifold based SSL. This argument is further discussed in Subsection II-C. Some recent work in Semi-Supervised Learning literature has focused on proposing graph construction methods that best represent the manifold structure. π-NN and π-ball are two classical methods of graph construction [6]. Several schemes have been proposed to improve the π-NN graph construction method. Moreover, Jebara et al. proposed the π-matching algorithm [7] which unlike the π-NN method, produces a balanced graph, i.e. all nodes have the same number of neighbours. The effectiveness of this method has been corroborated by both theoretical and experimental justifications [8]. However, quite a lot of graph construction methods have utilized the Euclidean distance of data points as a measure for evaluating distance. Unfortunately, this approach is misleading at times, since two points having a small Euclidean distance may be situated far apart on the manifold. In this case two points are connected while in fact they are distant from each other on the manifold. Such edges are called shortcut edges [9]. In this case the graph does not represent the manifold structure correctly. This situation can be prevented using a distance measure that reflects the distance of data points more efficiently. Approaching the correct distance of data points enables us to determine the neighbourhood of a data point more precisely. Cukierski et. al. have proposed a method to identify shortcut edges via the Betweenness Centrality measure[9]. Betweenness Centrality is related to the number of graph shortest paths (between any two vertices) that pass through a specific edge. They intuitively argue that the shortcut edges are probable to have high Betweenness Centrality, and used this fact to remove these edges. However, to the best of our knowledge, this argument is not justified theoretically. In this paper, we introduce a novel algorithm to determine shortcut edges and adjust their weights with the aim of approaching the real distance of the points on the manifold. The graph constructed by this algorithm is based on the intrinsic distance between points, hence we named it Isograph. We provide a solid theoretical foundation for our work, and actually our algorithm comes out from the theorems. Experiments on benchmark datasets show promising results compared to the previous state-of-the-art work. It is noticeable that our method can be initialized from any arbitrary graph, therefore it can be easily combined with many previous graph construction methods. Finally,f note that the famous Isomap, tries to estimate the geodesic distance between the points and the result of these methods are considerable [10]. However Isomap can not find the geodesic distance between near points properly. It concentrates on finding geodesic distance between far points, therefore can not used in graph construction methods. The remaining of the paper is organized as follows. In section II we introduce the notations used throughout the paper and provide basic definitions in this field. This section can be safely skipped by the experienced reader. Next in SectionIII and IV we explain the motivations of our algorithm and the basic idea of geodesic distance. This is followed by the precise problem setting in Section V. In section VI we present our proposed method and give theoretical justifications for it. Finally in section VII, the experimental results of applying our method on both synthetic and real world datasets are presented. Figure 1. Four of five given data points are labeled. The goal is to predict the label of the unlabeled data point A) Without any prior knowledge, B) Knowing the Manifold/Cluster assumption [4]. prediction is class -, as the two points near the unknown point are from this class. Now suppose we somehow know that the data points are only distributed on the curve shown in Figure 1 B and expect the labels of adjacent points on this curve to be similar. In this new setting the label + is a better candidate, as the two adjacent points of the unlabeled point are both from class +. The assumption just mentioned is called Manifold/Cluster assumption and is generalized to π-dimensional spaces. The Manifold/Cluster assumption in fact consists of two parts: The Manifold assumption and the Cluster assumption. The Manifold assumption states that data points lie on a π-dimensional manifold (denoted by β³) in the π-dimensional feature space (π βͺ π). The Cluster assumption states that labels of the points vary smoothly on β³. We will use the term βManifold assumptionβ instead of Manifold/Cluster assumption in the rest of this paper. Suppose π is a labeling function on β³ .i.e a function from β³ to β. The smoothness of π is formally defined as [11]: β« β₯β½π β₯2 ππβ³ (1) π(π ) = β³ II. BASICS AND N OTATIONS A. Classification Setting Consider the set of possible classes {β1, +1}, and let the feature space be the π-dimensional real-valued space: βπ . We denote the labeled data as ππΏ = {x1 , . . . , xπ } with corresponding labels y = (π¦1 , . . . , π¦π ), where xπ β βπ and π¦π β {β1, +1}. The unlabeled dataset is given as ππ = {xπ+1 , . . . , xπ+π’ } where xπ β βπ . In this setting ππΏ , ππ and y are given and the goal is to find an estimation of the labels of the data f = (π1 , . . . , ππ+π’ ) for all the points. The value of ππ is a real value between -1 and 1 and bigger values of ππ correspond to more membership in class +1. Practically these values are mapped to {β1, +1} after inference is done. This classification problem can be generalized to the case of possible classes {1, 2, . . . , π} where π β₯ 2 using the oneagainst-all method. B. Manifold/Cluster assumption In Figure 1 A the goal is to predict the label of the unlabeled point. Without any other prior knowledge the best Actually π(π ) captures the concept of roughness instead of smoothness, but we will name it smoothness as previous authors have done. The Manifold assumption also states that π(π ) must have a small value. C. Neighbourhood graph In SSL algorithms, we should use a discrete representation of the manifold, as we only have a finite number of data points. Therefore graphs are suitable representation for manifolds. To build a graph from the data points given, we consider one vertex for each data point and add edges between data points that are adjacent on the manifold, hence we call it neighbourhood graph. If the underlying manifold is known, constructing such a graph is straightforward. The challenging problem occurs when we do not have the manifold, which is the case in real world problems. This problem is called βgraph constructionβ. An example of a complex manifold together with the neighbourhood graph constructed on the manifold is shown in Figure 2. construction, when tested on the digit recognition task and text classification. In the experimental results section, we will show that Isograph can improve the graph generated by π-matching. D. Label Inference Figure 2. A curved 2D manifold in the 3D feature space. The data points are shown as black dots, and the neighbourhood graph edges are shown as lines connecting data points [12]. We denote the neighbourhood graph constructed by πΊ = (π, πΈ), where π = π (πΊ) is the set of vertices of the graph and πΈ = πΈ(πΊ) is the set of edges of the graph. Each edge on the graph represents a neighbourhood relationship and the edge weights are the distances of the corresponding endpoints. The weight of edge π = (π’, π£) is denoted by π(π) or π(π’, π£) throughout the paper. For simplicity we assume π(π’, π£) = β if no edge exists between π’ and π£ in πΊ. We choose the neighbours of each vertex and the weights of edges such that should approximate the manifold structure. Several methods have been proposed for graph construction, which have tried to present appropriate approximations of the manifold structure. We introduce some of these methods in the following. 1) Classical graph construction methods: π-NN and πball are two classical methods of graph construction [6]. In the π-NN graph construction method, each data point in the graph is connected to the π nearest neighbours of the data points, and the weights are the euclidean distance between the endpoints. As we always put the reverse edges in the graph to make the graph symmetric, The degree of some of the vertices might get much greater than π. In the π-ball method, each data point is connected to data points which the distance between them are less than π and the weight of that edge is equal to the distance between endpoints. There is no constraint for the degree of vertices in this method. If π is too small the resulting graph will be too sparse and a big π will result too many non-relevant edges, therefore finding a suitable π is a hard task. Hence π-NN is used more frequent in practice. 2) π-matching: π-matching is a well-known state-of-theart graph construction method that has experienced active research in the recent years [8], [7]. π-matching creates a balanced graph which has equal degree π for all vertices. This method works well when samples are distributed nonuniformly in the feature space. Theoretical foundations for this method have been presented. This method is reported to improve the π-NN graph We have used distances as edge weights but another near concept, namely similarity, is needed for semi-supervised label inference. Similarity is the converse of distance; when distance between two data points is low there similarity is high and vice versa. Similarity can be derived from distance in few ways, among them is Gaussian similarity. Let π be the similarity matrix corresponding to graph πΊ, that is πππ is the similarity between vertices π and π. Then Gaussian similarity is defined as π(π, π)2 ) π2 We mentioned that smoothness is defined as: β« β₯β½π β₯2 ππβ³ π(π ) = πππ = exp(β (2) β³ As we just have finite number of points on the manifold and need to infer π just on these points, we can approximate smoothness restricted to these points as [11]: Λ )= π(f π+π’ β Wππ (ππ β ππ )2 (3) π,π=1 The label inference process is based on finding an π which minimizes a mixture of both π(π ) and the error of π on the labeled data. Λ ) can be written in the following It is easy to show that π(f quadratic form: Λ ) = f β€ Lf π(f (4) with L = Dββ W, where D is the diagonal degree matrix π+π’ (i.e. Dππ = π=1 Wππ ). L is known as the graph Laplacian. The inference minimization problem is formally defined in the following form. f β = min β₯Cf β yβ₯2 + πΎf β€ Lf (5) f ) ( C = Iπ×π 0π×π’ is a selection matrix, .i.e Cf only has the labeled indexes of f , therefore β₯Cf β yβ₯2 represents the difference between y and f . An algorithm of running time π(π3 ) can compute the solution to this equation, where π is the number of data points. III. M OTIVATION As previously mentioned, shortcut edges connect those points of the graph which are close to each other according to the Euclidean distance, but have large geodesic distance on the manifold. An example of such edges and the underlying manifold is shown in Figure 3. These edges may be disastrous to the label inference process. According to u v Figure 3. Part of a one-dimensional manifold showing the shortcut edge between π’ and π£ is a popular assumption in Semi-Supervised Learning and the sampling condition is a reasonable condition which is common in the manifold learning literature [14]. In our problem setting, we have the following assumptions: 1) The data points lie on a π-dimensional manifold, denoted by β³. 2) Sampling condition: The manifold β³ is sampled as follows: There exists πΏ β β such that for any point π β β³, there exists a data point π in the labeled or unlabeled data points, such that πβ³ (π, π) β€ πΏ. We refer to the least such πΏ as πΏ(β³). 1 VI. P ROPOSED M ETHOD Figure 4. Geodesic curve between two points on a manifold [13] the Manifold assumption, we expect close data points on the manifold to have similar labels. This condition may be violated in the case of shortcut edges, since the adjacent data points are actually far from each other on the manifold. Therefore, it is crucial to find such edges and reduce their impact on the inference process. We expect a graph which has fewer shortcut edges to perform better in classification, therefore shortcut edge detection is a key problem in neighbourhood graph construction. In fact this paper aims at detecting such edges and removing them or adjusting their weights in an appropriate manner. In this section, we want to pass down the intuition of the proposed algorithm with a basic algorithm. Then we add more details and practical modification to this and introduce our final algorithm. One major improvement of the final algorithm over the baseline is that it adjusts the weights of shortcut edges, instead of naively removing all edges suspicious to being shortcut. A. The baseline algorithm V. P ROBLEM S ETTING In the baseline algorithm, we mainly try to detect the shortcut edges. An edge (π’, π£) from the neighbourhood graph is a shortcut edge if and only if π(π’, π£) βͺ πβ³ (π’, π£). Looking back at Figure 3, we can observe that an important feature of edge (π’, π£) -which is a shortcut edge- is that when we remove this edge, the shortest path between π’ and π£ which only contains small edges, must nearly pass through the curved manifold, therefore in fact this path has a lot of edges. This is the key intuition to our algorithm, which we explain more precisely in the following. Suppose we start with an initial graph πΊ, achieved by any graph construction method. For any edge π = (π’, π£) β πΈ(πΊ) with weight π, we consider the subgraph containing edges with weights less than π (the small edges previously mentioned). Assume that the shortest path between π’ and π£ be a long path in this subgraph. All edges on this path have smaller weights than the edge π, therefore we expect this path to better represent the geodesic curve between π’ and π£ compared to edge π. As a result, the estimation of geodesic distance may be better achieved using this path. If the number of edges in such a path is big enough, e.g. bigger than two, it is probable that (π’, π£) is connecting points that may be far on the manifold, and therefore (π’, π£) is probably a shortcut edge. In the following, we prove that the threshold length of two is an appropriate measure for detecting shortcuts. This procedure does not perform for the edges of Minimum Spanning Tree (MST) of the initial graph of πΊ In this section, we introduce the assumptions which we have based our algorithms on. These contain the Manifold assumption and a sampling condition. Manifold assumption 1 β³ is assumed to be bounded. This is reasonable because usally in a machine representation the feature space is finite and β³ is a subset of the feature space. IV. G EODESIC D ISTANCE In the plane, the shortest path between two points is the straight line connecting them, but in general manifolds such as sphere, this line does not lie on the manifold. Therefore, we need a new concept to define distance between points on manifolds. Geodesic curves are curves lying on the manifold connecting points with the shortest path (Figure 4). Definition 1. For any two points π and π on the manifold β³, we define πβ³ (π, π) as the length of the shortest curve between π and π lying on β³. Proposition 1. For any π, π β β³: π(π, π) β€ πβ³ (π, π), where π(π’, π£) is the metric in the ambient space This is intuitively clear, but can be proven rigorously using straight line segments for length estimation. (MST(πΊ)), because we claim that preserving edges MST(πΊ) is necessary for graph construction in proposed algorithm. A disconnected graph is disastrous to the process of label inference. To ensure the connectivity of πΊ we do not remove any of the edges in π ππ (πΊ). π ππ (πΊ) is chosen to prefer smaller edges as they are less probable to be shortcut edges. Require: An initial graph πΊ built with a graph construction method (e.g. π-NN) Ensure: Shortcuts of graph πΊ are removed 1: Let πΊπ be the full graph on the sampling, i.e. the graph which contains edges π = (π’, π£) for all π’, π£ β π (πΊ) 2: for all π = (π’, π£) β πΈ(πΊ) β πΈ(π ππ (πΊ)) in ascending order of distance do 3: πΊπ’,π£ β the subgraph of πΊπ with edge weights less than π(π’, π£) 4: π β length of shortest path in πΊπ’,π£ between π’ and π£ 5: if π > 2 then 6: Remove edge π from πΈ(πΊ) 7: end if 8: end for Algorithm 1: The baseline algorithm To justify the correctness of our algorithm, we should show that the baseline algorithm preserves an edge (π’, π£) β πΈ(πΊ) if π(π’, π£) is close enough to πβ³ (π’, π£), and removes it otherwise. We already know from Proposition 1 that π(π’, π£) β€ πβ³ (π’, π£) is always held. In the following theorems, we first justify that if πβ³ (π’, π£) is not too larger than π(π’, π£), the edge is not removed by the baseline algorithm. Theorem 1. If πβ³ (π’, π£) < 2π(π’, π£) β 2πΏ(β³), where πΏ(β³) is defined in Definition 1, then the baseline algorithm will preserve edge (π’, π£). Proof: This theorem is a special case of Theorem 3, which will be proved in Appendix A. In order to complete the justifications, we further show that if πβ³ (π’, π£) is much larger than π(π’, π£), edge (π’, π£) will be removed by the baseline algorithm. To do so, we need to define some concepts first. Definition 2. 1) Consider all unit-speed geodesic curves πΆ completely lying on β³. The minimum radius of curvature π0 = π0 (β³) is defined by 1 ¨ β₯} = max{β₯ πΆ(π‘) πΆ,π‘ π0 ¨ represents the second derivation of πΆ with where πΆ(π‘) respect to π‘ [14]. 2) The minimum branch separation π 0 = π 0 (β³) is defined as the largest positive number for which, u v w Figure 5. The geodesic path between two endpoints of edge π = (π’, π£) is showed by the dashed line. The geodesic paths between pairs π’, π€ and π€, π£ are shown by solid curves. π(π₯, π¦) < π 0 implies that πβ³ (π₯, π¦) β€ ππ0 , for every π₯, π¦ β β³, where π0 is the minimum radius of curvature [14]. Definition 3. Manifold β³ is called geodesically convex if there exists a Mathematically geodesic curve πΆ between any two arbitrary points π₯, π¦ β β³ with the length πβ³ (π₯, π¦) [14]. A Mathematically geodesic curve πΆ on manifold β³ is a curve where the geodesic curvature is zero on all points of the curve [15]. This condition is just needed for next theorem. Theorem 2. If β³ is a geodesically convex manifold and there exist π’, π£ β β³ where π(π’, π£) < π 0 and πβ³ (π’, π£) β₯ 2 1βπ0 π(π’, π£), then the baseline algorithm removes edge π = (π’, π£), where π0 is a constant for a given manifold β³ and 2 (β³)2 can be computed by π0 = π96ππ 00(β³) 2 . Proof: Suppose the baseline algorithm does not remove edge π. Therefore, according to the baseline algorithm, the length of the shortest path between π’ and π£ in πΊπ’,π£ equals to two (It can not be one because we omit (π’, π£)) and hence, there exist edges π1 = (π’, π€) and π2 = (π€, π£) such that π(π’, π€) < π(π’, π£) and π(π€, π£) < π(π’, π£) (Figure 5). From [14], we know that for any arbitrary 0 < π < 1, if the points π₯, π¦ from a geodesically convex manifold β³ satisfy the conditions: 2 β πππ π(π₯, π¦) β€ π0 24π (6) π(π₯, π¦) < π 0 π then we have: πβ³ (π₯, π¦) β₯ π(π₯, π¦) β₯ (1 β π)πβ³ (π₯, π¦) Taking π₯ = π’, π¦ = π£ and π = π0 , it can be easily verified that the conditions in equation 6 are satisfied for our case. As π(π’, π€) < π(π’, π£), the conditions in 6 also hold for π₯ = π’, π¦ = π€, π = π0 . Therefore we have: π(π’, π€) β₯ (1 β π0 )πβ³ (π’, π€). Combining this result with the previously known relation π(π’, π£) > π(π’, π€), we can conclude that: π(π’, π£) > π(π’, π€) β₯ (1 β π0 )πβ³ (π’, π€). A similar conclusion can be made taking π₯ = π£, π¦ = π€ and π = π0 : π(π’, π£) > π(π€, π£) β₯ (1 β π0 )πβ³ (π€, π£). Summing up these two relations we reach the following conclusion: π(π’, π£) > 1 β π0 1 (1βπ0 )(πβ³ (π€, π’)+πβ³ (π€, π£)) β₯ πβ³ (π’, π£) 2 2 This contradicts the assumption that π(π’, π£) β€ 0 πβ³ (π’, π£) 1βπ 2 . Therefore, the baseline algorithm will not remove edge π = (π’, π£) and the proof ends here. B. Shortcomings The baseline algorithm has two drawbacks. First, if πβ³ (π’, π£) is small (i.e. close to 2πΏ(β³)), Theorem 1 can not guaranty that edge (π’, π£) is not removed, even when π(π’, π£) βΌ = πβ³ (π’, π£). If πβ³ (π’, π£) < 2πΏ(β³), in equation of Theorem 1, the πΏ(β³) has greater influence than πβ³ (π’, π£). Consequently, the algorithm may remove wrong edges, because now the precondition of the theorem is in risk of being not true. An example of this situation occurs on a plane-shaped manifold, where π(π’, π£) is exactly equal to πβ³ (π’, π£) for all π’, π£ β π (πΊ). Even though no shortcut edge exists in this case, the baseline algorithm may remove some of the edges in πΊ. Secondly, although the baseline algorithm is able to pinpoint the large difference between π(π’, π£) and πβ³ (π’, π£) for a shortcut edge, it naively removes the edges. The classification result will improve if these edges have a very small effect on inference instead of removing them. That is, adjusting the edge weights in an appropriate manner, is a better solution. This way, we can estimate the structure of the manifold more accurately. These shortcomings are overcome in the proposed algorithm, namely Isograph, which is described in the next section. C. An improved algorithm: Isograph We now propose the Isograph algorithm to overcome the shortcomings described in the previous section. This algorithm is a modified version of the baseline algorithm with two improvements: β To overcome the problem with small values of πβ³ (π’, π£), Isograph leaves all edges with π(π’, π£) β€ π unchanged. If we choose π such that π > 2πΏ(β³), since we know πβ³ (π’, π£) β₯ π(π’, π£), then πβ³ (π’, π£) > 2πΏ(β³), which solves the first problem. β To overcome the second shortcoming, Isograph maintains an estimated value πΛβ³ (π’, π£) for each edge (π’, π£) β πΈ(πΊ) and if πΛβ³ (π’, π£) is too far from πβ³ (π’, π£), instead of removing this shortcut edge, it increases the edge weight, πΛβ³ (π’, π£), to become a better estimation for geodesic distance. Therefore, the same graph structure is achieved with better edge weights which might result in updating other edges in the next iterations. In Theorem 3, we show that updating in multiple iterations will increase the edge weights and makes it more near to the geodesic distance and therefore a more accurate estimation of the geodesic distance is achieved. As previously mentioned in Theorem 1, it can be proven that for any edge (π’, π£), which is detected as shortcut by the baseline algorithm, we have: πβ³ (π’, π£) β₯ 2(π(π’, π£) β πΏ(β³)) Therefore, we may use the following update rule for edge weights: πΛβ³ (π’, π£) β 2(π(π’, π£) β πΏ(β³)) Later in Theorem 4, we will show that this is actually an appropriate updating rule which gives a better estimation of πβ³ (π’, π£). Using the π constraint, and updating the edge weights iteratively, using the mentioned updating rule, we come up with Isograph. Require: An initial graph πΊ built with a graph construction method (e.g. π-NN) Ensure: Adjusted edge weights: πΛπ‘β³ (π’, π£), β(π’, π£) β πΈ(πΊ) 1: for all π = (π’, π£) β πΈ(πΊ) do (1) 2: πΛβ³ (π’, π£) β π(π’, π£) 3: end for 4: for π‘ = 1 . . . ππ’ππππππ πΌπ‘ππππ‘ππππ do 5: for all π = (π’, π£) β πΈ(πΊ) β πΈ(MST(πΊ)) do (π‘) 6: if πΛβ³ (π’, π£) β₯ π then 7: πΊπ’,π£ β the subgraph of πΊ with edge weights (π‘) less than πΛβ³ (π’, π£)2 8: π β length of shortest path in πΊπ’,π£ between π’ and π£ 9: if π > 2 then Λπ‘ 10: πΛπ‘+1 β³ (π’, π£) β 2(πβ³ (π’, π£) β πΏ(β³)) 11: end if 12: end if 13: end for 14: end for Algorithm 2: Isograph (The proposed algorithm) In the following theorems, we will prove that the following loop invariant holds throughout the procedure of Isograph: π(π’, π£) β€ πΛπ‘β³ (π’, π£) β€ πβ³ (π’, π£) (7) In addition, we show that the difference between the real and the estimated values of πβ³ (π’, π£) decreases by updating edge weights in each iteration. The following theorems show that the estimated value of geodesic distance is always between the Euclidean distance and the real geodesic distance. Therefore, we may increase edge weights iteratively, without worrying about exceeding the true distance. β(π’, π£) β πΈ(πΊ) : Theorem 3. Assuming the loop invariant (equation 7) holds at some time instance, if πβ³ (π’, π£) < 2πΛβ³ (π’, π£) β 2πΏ(β³), (π‘) 2 In fact we also add any (π₯, π¦) β / πΈ(πΊ) such that π(π₯, π¦) β€ πΛβ³ (π’, π£) to πΊπ’,π£ then Isograph will not update edge π = (π’, π£) (line 10 of Algorithm 2). Theorem 4. At any point throughout the procedure of Isograph, equation 7 holds. The proof of these theorems is included in Appendix A. Lemma 1. If edge π = (π’, π£) is updated at iteration π‘ then Λπ‘ πΛπ‘+1 β³ (π’, π£) > πβ³ (π’, π£) (8) Proof: Edge π is updated so: Λπ‘ πΛπ‘+1 β³ (π’, π£) = 2(πβ³ (π’, π£) β πΏ(β³)) We know that πΛπ‘β³ (π’, π£) β₯ π > 2πΏ(β³). Adding πΛπ‘β³ (π’, π£) to both sides of this relation we have: 2πΛπ‘β³ (π’, π£) > 2πΏ(β³) + πΛπ‘β³ (π’, π£) Therefore β 2(πΛπ‘β³ (π’, π£) β πΏ(β³)) > πΛπ‘β³ (π’, π£) πΛπ‘+1 (π’, π£) > πΛπ‘ (π’, π£) β³ β³ D. Practical modifications In Isograph, we have explicitly used πΏ(β³) and π. From a sampling of the manifold β³, we can not exactly specify its underlying geometry. There are many manifolds passing through the same sample points, each having different πΏ(β³) values, therefore computing πΏ(β³) is naturally an ill-posed problem. However, Lemma 2 gives a lower bound for πΏ(β³). Lemma 2. If the maximum edge in the minimum spanning tree(MST) of the neighbourhood graph has weight πππ π‘ , then we have πΏ(β³) β₯ πππ π‘ 2 , where πΏ(β³) is defined in section V. The proof of this lemma will come in Appendix A. As (π(π)) Lemma 2 indicates, πΏ(β³) β₯ πππ₯πβMST(G) . However 2 we argued that estimating πΏ(β³) is an ill-posed problem, therefore in order to estimate πΏ(β³), we must suppose that the data provided to our algorithms lies on a manifold with some intuitively reasonable constraints. i.e we must assume some prior knowledge about πΏ(β³). Another issue is that in many cases the sampling might be sparse in just a small region of β³. As πΏ(β³) is defined as a global parameter on the sampling, this results in a big value of πΏ(β³), however the local value of πΏ(β³) may be much smaller in many regions. In Theorem 3 where πΏ(β³) entered our formulation, we do not need a global bound on πΏ(β³), so we can use a local bound instead. We assume that the local value of πΏ has the same order of magnitude as πmst in Lemma 2 i.e. πΏ = πΌ π2mst . Sparsity at some parts of the manifold can be more than other parts, and an adaptive method of estimating πΏ, will clearly help Isograph. The πΌ π2mst estimation of πΏ is not adaptive, therefore to overcome this problem we will try to solve it in an indirect manner. We already know that all edges in the ππ -NN graph, where ππ βͺ π are rarely shortcut edges. Therefore its reasonable if we do not modify these edges at all. This showed to be effective in practice. VII. E XPERIMENTAL R ESULTS In this section we present the experimental results and demonstrate the effectiveness of our neighbourhood graph construction method when applied on the π-NN graph and on the output of b-matching. A. Synthetic Datasets To evaluate our proposed method, we generated three synthetic datasets. We are able to evaluate the effectiveness of our algorithm by illustrating the edges which are detected as shortcuts. Each dataset lies on a different 2D manifold shape embedded in a 3D space: Swiss roll, Step and Ellipsoid (Figure 6). The data points are generated (200 points) by a uniform i.i.d sampling on the manifold surface and each point is translated by a independent random offset. The πNN method is used to construct the neighbourhood graph where π is selected as the smallest value for which a considerable number of shortcut edges emerge. The parameter ππ (section VI-D), is set to 5 for all graphs. An effective shortcut edge detection algorithm eliminates the edges connecting two irrelevant points (i.e. edges with πβ³ β« π), while maintaining edges lying on the manifold, no matter how long the length of such edges may be. These properties are pursuant to Theorem 2 and 3 respectively. In these figures, it is easy to observe that our algorithm has both of these properties, and therefore is effective. For a better illustration, the graph edges of each manifold are partitioned and shown in two separate figures: edges which are detected as shortcuts, and those preserved. B. Real World Experiments In order to evaluate the proposed method, four standard datasets which are consistent with the Manifold assumption are selected which include MNIST, USPS, Caltech 101 and Corel. USPS and MNIST are digit recognition datasets and the others are image categorization datasets. For Caltech and Corel datasets, a subset of classes was selected and the CEDD feature set introduced by [16] was extracted; for MNIST and USPS the image is the feature vector itself (which is a low resolution image). Principal Component Analysis (PCA) is applied on all datasets for noise removal. For each dataset ten random sampling containing 2000 points of the whole data points were generated and Crossvalidation was used to partition the sampling into labeled and unlabeled points, such that there were ten labeled points for each class in average. The value of πΎ, ππ , πΌ and ππ’ππππππ πΌπ‘ππππ‘ππππ were selected as 0.02, 3, 0.5 and 3 respectively, for all experiments. In the first experiment, we have applied Isograph on the 10-NN graph. To illustrate the USPS Accuracy(%) Accuracy(%) MNIST 82 79 5 10 K 83 81 15 5 Caltech 10 K 15 Corel 72 Accuracy(%) 90 Accuracy(%) 0β1 kNN 0β1 kNN+Isograph kNN+ML kNN+ML+Isograph 88 5 10 K 15 70 5 10 K 15 Figure 8. Charts comparing the accuracy of Isograph applied on the π-NN with plain π-NN graph construction Figure 6. Shortcut detection in π-NN graphs of three noisy synthetic datasets: Ellipsoid (π = 20), Step (π = 22) and Swiss roll (π = 13). Figures on the right column illustrate the edges detected as shortcuts, and therefore updated by Isograph algorithm, and edges on the left column are those maintained. Figure 7. Shortcut edges detected by Isograph and the path found by the algorithm effectiveness of Isograph in detecting the shortcut edges, we have selected some of the updated edges and plotted the path found by Isograph between the endpoints (Figure 7). The first and the last pictures in each row represent the endpoints of an edge. Although this edge was in the 10-NN, the endpoints belong to two different classes. Therefore, our algorithm improves the graph structure by updating the edge between them. In the second experiment we applied Isograph on the πNN graph and measured the accuracy of the classifier built using the resulting neighbourhood graph. The results are presented in Figure 8 for the mentioned datasets. We have run our algorithm in two settings: 1) Binary: In this setting we only use unit weights for edges. The π-NN approach to graph construction in the binary setting is to connect each vertex to the π nearest neighbours with unit weight. We call this the β0-1 π-NN graphβ to discriminate it from the weighted π-NN graph. Isograph can be applied in binary graph construction by running Isograph in the following way: We build the weighted π-NN graph, run Isograph on this graph, and remove any edges from the graph that are updated by Isograph, after all iterations have finished. We name the result as β0-1 π-NN+Isographβ. Note that this is different from using the baseline algorithm. Edges are not removed in Isograph so they can influence on estimating geodesic distance of other edges; Hence, potentially more shortcut edges are updated. 2) Weighted: We compare Isograph with the βπNN+MLβ graph in this setting. The π-NN+ML graph is constructed by creating the weighted π-NN graph using the similarity Equation 2. To build the βπ-NN+ML+Isographβ we applied Isograph on the weighted π-NN and used Equation 2. In both weighted methods, Marginal Likelihood (ML) was used to find the best π for creating the similarity matrix W. All of four graph constructions above can be redone by using any arbitrary graph construction method instead of π-NN method. For instance, we combined Isograph with π-matching and showed that the classification accuracy is superior to plain π-matching on every four datasets that mentioned before (Figure 9). USPS Accuracy(%) Accuracy(%) MNIST 78 75 5 10 K 77 Figure 10. The shortest curve lying on β³ connecting π’ and π£. π is the midpoint of the curve. 74 15 5 10 K 15 Corel Caltech 71 Accuracy(%) 81 Accuracy(%) 0β1 bMatching 0β1 bMatching+Isograph bMatching+ML bMatching+ML+Isograph 77 5 10 K 15 70 5 10 K 15 Figure 9. π-matching is combined with Isograph in the 0-1 and ML setting These figures show steady improvement of Isograph in all the settings presented and the improvements are robust to π. As we can see, in MNIST and USPS, π-NN+Isograph considerably improved the results, This shows that Isograph detects the shortcut edges perfectly. In these two datasets π-NN+ML+Isograph works better for small values of π, however performance slightly degrades for larger amounts of k. This phenomenon can be explained by the fact that for small values of π, the π-NN graph uses shorter edges, that probably have smaller difference between their π and πβ³ , therefore a maximum of three iterations is enough to reach their correct weight. However when π increases in spite of the fact that we detect shortcut edges correctly, we will not update the weight of edges in an appropriate number of iterations, as the change in each iteration is limited to a factor of nearly two. On the Corel dataset, ML improves the results of 0-1 πNN and 0-1 π-NN+Isograph. In this setting, weights have an important role in inferencing labels correctly. Therefore, the difference between weighted and 0-1 is considerable. In contrast on the Caltech, ML weighting has not a positive effect, therefore, we see the best results in 0-1 π-NN+Isograph. This might be due to the possibility of non-equal number of labeled data from each class, however note that Isograph has still improved the 0-1 graph. Furthermore, we combined Isograph with π-matching and showed that the classification accuracy is superior to plain π-matching on all datasets with robustness w.r.t to the parameter π (Figure 9). VIII. C ONCLUSIONS In this paper, we showed that using geodesic distance instead of Euclidean distance will improve the neighbourhood graph. Therefore, we proposed an unsupervised method (Isograph) to estimate the geodesic distance between points. We have provided bounds on the values of the geodesics estimated by Isograph. As Isograph can be combined with other graph construction methods, we combined it with π-NN and π-matching and presented the results on realworld datasets, which show steady effectiveness of Isograph. The effectiveness of using geodesic distance in the graph construction procedure and convergance of the Isograph algorithm are subject of future theoretical analysis. Better local estimation of πΏ may lead to better geodesic distance estimation. Furthermore labeled data may be employed to improve the shortcut detection procedure. IX. A PPENDIX A: P ROOF OF SOME OF THE THEOREMS Theorem 3. Assuming the loop invariant (Equation 7) holds at some time instance, if πβ³ (π’, π£) < 2πΛβ³ (π’, π£) β 2πΏ(β³), then Isograph will preserve edge π = (π’, π£). Proof: Consider a shortest curve πΆ on β³ starting from u and ending at v (Figure 10). Let m be the midpoint of curve πΆ, that is the point that halves the length of πΆ. From the sampling condition we know that there exists a point π€ in the sampling such that πβ³ (π, π€) β€ πΏ(β³). We first want to show that π€ can not coincide any of π’ or π£. From the loop invariant assumption we have: πΛβ³ (π’, π£) β€ πβ³ (π’, π£). As we had assumed πβ³ (π’,π£) +πΏ(β³) < πΛβ³ (π’, π£), 2 we get πβ³ (π’, π£) = πβ³ (π’, π) = πβ³ (π, π£) 2 This means that π€ can not be any of π’ or π£ because πβ³ (π, π€) β€ πΏ(β³). Now by the triangle inequality we have: πΏ(β³) < πβ³ (π’, π€) β€ πβ³ (π’, π) + πβ³ (π, π€) By adding πβ³ (π’, π) = we get πβ³ (π’,π£) 2 πβ³ (π’, π) + πβ³ (π, π€) β€ and πβ³ (π, π€) β€ πΏ(β³) πβ³ (π’, π£) + πΏ(β³) 2 So we have πβ³ (π’, π€) β€ πβ³ (π’, π£) + πΏ(β³) 2 Finally plugging the last inequality in the assumption that πβ³ (π’,π£) + πΏ(β³) < πΛβ³ (π’, π£) we reach 2 πΛβ³ (π’, π€) β€ πβ³ (π’, π€) < πΛβ³ (π’, π£) In a similar way we have πΛβ³ (π£, π€) < πΛβ³ (π’, π£). Therefore, edges (π’, π€) and (π£, π€) are both in πΈ(πΊπ’,π£ ) and Isograph will preserve edge (π’, π£) due to point π€. Theorem 4. At any point throughout the procedure of Isograph, Equation 7 holds. Proof: We show that Equation 7 is a loop invariant, that is we must show that: 1) Equation 7 is true when initializing πΛβ³ at the beginning of the algorithm. 2) With the assumption that Equation 7 holds at some time instance , it still holds after updating an edge. (1) Item one is true because πΛβ³ = π(π’, π£), and by Proposition 1 we have π(π’, π£) β€ πβ³ (π’, π£). We now prove item two. Suppose at some time π‘ the loop invariant holds. We should show that πΛπ‘+1 β³ (π’, π£) β€ πβ³ (π’, π£). According to theorem 3, if an edge is updated we must have: + πΏ(β³), so πΛπ‘β³ (π’, π£) β€ πβ³ (π’,π£) 2 Λπ‘ πΛπ‘+1 β³ (π’, π£) = 2(πβ³ (π’, π£) β πΏ(β³)) β€ πβ³ (π’, π£) Lemma 2. If the maximum edge in the minimum spanning tree (MST) of the neighbourhood graph has weight πmst , we have πΏ(β³) β₯ π2mst , where πΏ(β³) is defined in section V. Proof: Let (π’, π£) be the edge with maximum weight in the MST. Suppose that removing edge (π’, π£) results in two connected components πΆ1 and πΆ2 . Define πβ³ (π₯, πΆ1 ) to be the minimum distance of point π₯ to points in πΆ1 i.e. πβ³ (π₯, πΆ1 ) = ππππ¦βπΆ1 πβ³ (π₯, π¦). πβ³ (π₯, πΆ2 ) is defined in a similar way. Now, let πΆ be any curve between π’ and π£. For any point π₯ on this curve we compute π (π₯) = πβ³ (π₯, πΆ1 ) β πβ³ (π₯, πΆ2 ). We know π (π’) < 0 and π (π£) > 0 and f is continuous, so by the intermediate value theorem, there exists a point π₯β on curve πΆ such that π (π₯β ) = 0. Suppose that π₯1 be the point from πΆ1 that has minimum distance from π₯β and π₯2 is defined in a similar way. By definition πΏ(β³) β₯ πβ³ (π₯β , π₯1 ) = πβ³ (π₯β , π₯2 ), so we have 2πΏ(β³) β₯ πβ³ (π₯β , π₯1 ) + πβ³ (π₯β , π₯2 ) β₯ πβ³ (π₯1 , π₯2 ) β₯ πΛβ³ (π₯1 , π₯2 ) As (π’, π£) is an edge in MST, πΛβ³ (π’, π£) is the shortest edge between any vertex in πΆ1 to some arbitrary vertex in πΆ2 . Therefore 2πΏ(β³) β₯ πΛβ³ (π₯1 , π₯2 ) β₯ πΛβ³ (π’, π£) = πππ π‘ R EFERENCES [1] R. Ando and T. Zhang, βA high-performance semi-supervised learning method for text chunking,β in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 1β9, 2005. [2] S. Basu, M. Bilenko, and R. Mooney, βA probabilistic framework for semi-supervised clustering,β in Proceedings of the tenth ACM international conference on Knowledge discovery and data mining, pp. 59β68, 2004. [3] S. Hoi and M. Lyu, βA semi-supervised active learning framework for image retrieval,β in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 302β309, 2005. [4] O. Chapelle, B. SchoΜlkopf, and A. Zien, Semi-supervised learning, vol. 2. MIT press Cambridge, MA, 2006. [5] M. Belkin and P. Niyogi, βSemi-supervised learning on riemannian manifolds,β Machine Learning, vol. 56, no. 1, pp. 209β239, 2004. [6] X. Zhu, J. Lafferty, and R. Rosenfeld, Semi-supervised learning with graphs. PhD thesis, 2005. [7] T. Jebara, J. Wang, and S. Chang, βGraph construction and bmatching for semi-supervised learning,β in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 441β448, ACM, 2009. [8] B. Huang and T. Jebara, βLoopy belief propagation for bipartite maximum weight b-matching,β Artificial Intelligence and Statistics, 2007. [9] W. Cukierski and D. Foran, βUsing betweenness centrality to identify manifold shortcuts,β in IEEE International Conference on Data Mining Workshops, pp. 949β958, 2008. [10] J. Tenenbaum, V. Silva, and J. Langford, βA global geometric framework for nonlinear dimensionality reduction,β Science, vol. 290, no. 5500, p. 2319, 2000. [11] M. Belkin and P. Niyogi, βProblems of learning on manifolds,β The University of Chicago, 2003. [12] M. Hein and U. von Luxburg, βIntroduction to graph-based semi-supervised learning,β [13] J. Odegard, βDimensionality reduction methods for molecular motion,β [14] M. Bernstein, V. De Silva, J. Langford, and J. Tenenbaum, βGraph approximations to geodesics on embedded manifolds,β tech. rep., Technical report, Department of Psychology, Stanford University, 2000. [15] M. Do Carmo, Riemannian geometry. Birkhauser, 1992. [16] Y. Chen and J. Wang, βImage categorization by learning and reasoning with regions,β The Journal of Machine Learning Research, vol. 5, pp. 913β939, 2004.