Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Non-Uniform Distribution of Nodes in the Spatial Preferential Attachment Model Jeannette Janssen∗ Pawel Pralat† Rory Wilson‡ October 14, 2015 Abstract The spatial preferential attachment (SPA) is a model for complex networks. In the SPA model, nodes are embedded in a metric space, and each node has a sphere of influence whose size increases if the node gains an in-link, and otherwise decreases with time. In this paper, we study the behaviour of the SPA model when the distribution of the nodes is non-uniform. Specifically, the space is divided into dense and sparse regions, where it is assumed that the dense regions correspond to coherent communities. We prove precise theoretical results regarding the degree of a node, the number of common neighbours, and the average out-degree in a region. Moreover, we show how these theoretically derived results about the graph properties of the model can be used to formulate a reliable estimator for the distance between certain pairs of nodes, and to estimate the density of the region containing a given node. Keywords— Spatial Random Graphs, Spatial Preferential Attachment Model, Preferential Attachment, Complex Networks, Web Graph, Co-citation, Common Neighbours 1 Introduction There has been a great deal of recent interest in modelling complex networks, a result of the increasing connectedness of our world. The hyperlinked structure of the Web, citation patterns, friendship relationships, infectious disease spread, these are seemingly disparate linked data sets which have fundamentally very similar natures. Many models of complex networks have a common weakness: the ‘uniformity’ of the nodes; other than link structure there is no way to distinguish the nodes. One family of ∗ Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada Department of Mathematics, Ryerson University, Toronto, ON, Canada ‡ Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada † 1 models which overcomes this deficiency is the family of spatial (or geometric) models, wherein the nodes are embedded in a metric space. A node’s position—especially in relation to the others—has real-world meaning: the character of the node is encoded in its location. Similar nodes are closer in the space than dissimilar nodes. This metric space has many potential meanings: in communication networks, perhaps physical distance; in a friendship graph, an interest space; in the World Wide Web, a topic space. As an illustration, a node representing a webpage on pet food would be closer in the metric space to one on general pet care than to one on travel. The Spatial Preferrential Attachment Model [1], designed as a model for the World Wide Web, is one such spatial model. Indeed, as its name suggests, the SPA Model combines geometry and preferential attachment. Setting the SPA Model apart is the incorporation of ‘spheres of influence’ to accomplish preferential attachment: the greater the degree of the node, the larger its sphere of influence, and hence the higher the likelihood of the node gaining more neighbours. The SPA model produces scale-free networks, which exhibit many of the characteristics of real-life networks (see [1, 4]). In [12], it was shown that the SPA model gave the best fit, in terms of graph structure, for a series of social networks derived from Facebook. As the motivation behind spatial models is the ‘second layer of meaning’—the character of the nodes as represented by their positions in the metric space—we hope to uncover this layer through examination of the link structure. In particular, estimating the distance between nodes in the metric space forms the basis for two important link mining tasks: finding entities that are similar—represented by nodes that are close together in the metric space—and finding communities—represented by spatial clusters of nodes in the metric space. We show how a theoretical analysis of a spatial model can lead to reliable tools to extract the ‘second layer of meaning’. The majority of the spatial models to this point have used uniform random distribution of nodes in the space. However, considering the real-world networks these models represent, this concept does not capture the following essential aspect of real-life data. Indeed, on a basic level, if the metric space represents actual physical space, and the nodes people, then we note that people cluster in cities and towns, rather than being uniformly spread across the land. More abstractly, there are more webpages on a popular topic, corresponding to a small area of our metric space, than for a more obscure topic. The development of spatial network models naturally then begins to incorporate varying densities of node distribution: both ‘clumps’ of higher/lower density, as well as gradually changing densities, are possibilities. One of the more important goals is that of community recognition: the discovery and quantification of characteristically (semantically) similar nodes. In this work we generalize the SPA model to an inhomogeneous distribution of nodes within the space. We assume distinct regions of different densities, where the dense regions are the ‘clusters’. We find that the local regions behave almost as if generated by independent SPA models of parameters derived from the densities. Many earlier results from the SPA Model then translate easily to this inhomogeneous version and we begin the process of uncovering the geometry using link analysis. 2 In the remainder of this section, we first review related work, and then we give a formal definition of the SPA model. In Section 2 we state our main theoretical results. In particular, we give the typical behaviour of the in-degree of a node, and use this to derive a relationship between spatial distance and number of common neighbours of a pair of nodes. The proofs of the theorems are given in Section 4. In Section 3 we verify the asymptotic results from Section 2 through a simulation of the SPA model to generate large graphs. Specifically, we show how the relationship between spatial distance and common neighbours can be used to devise a distance estimator which gives precise results. We also use the theoretical results to estimate the local density around a node. Our simulations show that these estimators give reliable results on the simulated data. 1.1 Background and Related Work Efforts to extract node information through link analysis began with a heuristic quantification of entity similarity: numerical values, obtained from the graph structure, indicating the relatedness of two nodes. Early simple measures of entity similarity, such as the Jaccard coefficient [19], gave way to iterative graph theoretic measures, in which two objects are similar if they are related to similar objects, such as SimRank [15]. Many such measures also incorporate co-citation, the number of common neighbours of two nodes, as proposed in the context of bibliographic research in an early paper by Small [20]. In [8], the authors make inferences on the social space for nodes in a social network, using Bayesian methods and maximum likelihood. Generative spatial models were proposed in a more general setting, where the main objective was to generate graphs with properties that correspond to those observed in real-life networks. Different approaches were explored, for example in [2] using thresholds, or in [5, 6] using a geometric variant of the preferential attachment. Graph properties of this model were analyzed by Jordan in [16]; follow-up work on this model can be found in [7, 18]. In [17], a non-uniform distribution of the points in space is considered. In [10], Jacob and Mörters propose a probabilistic spatial model where the link probability is a function decreasing with distance. The setting is general, and includes the SPA model as a special case. Follow-up work on this model can be found in [9]. The SPA model was first proposed in [1] as a model for the World Wide Web. In [1] and [4], it was proved that the SPA model produces graphs with certain graph properties that correspond to those observed in real-life networks. The authors’ previous paper, [14], used common neighbours to explore the underlying geometry of the SPA model and quantify node similarity based on distance in the space. However, the distribution of nodes in space was assumed to be uniform. The approach used in this paper is similar to that in [14], but we investigate the complications that arise when the distribution is non-uniform, which is clearly a more realistic setting.a An earlier version of this work, containing no proofs, was presented at the workshop WAW 2013. An extended abstract can be found in [13]. 3 1.2 The Inhomogeneous SPA Model We begin with a brief description of our inhomogeneous SPA model. The model presented here is a generalization of the SPA model introduced in [1], the main difference being that we allow for an inhomogeneous distribution of nodes in the space. Let S be the unit hypercube in Rm , equipped with the torus metric derived from the Euclidean norm, or any equivalent metric. The nodes {vt }nt=1 of the graphs produced by the SPA model are points in S chosen via an m-dimensional point process. Most generally, the process R is given by a probability density function ρ; ρ is a measurable function such that S ρdµ = 1, where µ is the Lebesque measure.R Precisely, for any measurable set A ⊆ S and any t such that 1 ≤ t ≤ n, P(vt ∈ A) = A ρdµ. In fact, we will restrict ourselves to probability functions that are locally constant. Precisely, we assume that the space S = [0, 1)m is divided into Lm equal sized hypercubes, where L is a constant natural number. Each hypercube is of the form Ij1 × Ij2 × · · · × Ijm (0 ≤ j1 , j2 , . . . , jm < L), where Ij = [j/L, (j + 1)/L). Note that any density function ρ can be approximated by such a locally constant function, so that this restriction is justified. To keep notation as simple as possible, we assume that each hypercube is labelled R` , 1 ≤ ` ≤ Lm . Let ρ` be the density of R` , so the density function has value ρ` on R` . For any node v, let R(v) be the hypercube containing v, and let ρ(v) be the density of R(v). Clearly, every hypercube has volume L−m . Then the probability that a node vt , introduced at time t, falls in R` equals q` = ρ` L−m , P and the expected number −m of points in R` equals q` n = ρ` L n. It is easy to see that ` q` = 1. Thus we may model the point process as follows: at each time step t, one of the regions is chosen as the destination of vt ; region R` is chosen with probability q` . Then, a location for vt is chosen uniformly at random from the chosen region R` . The SPA model generates stochastic sequences for graphs {Gt }t≥0 ; for each t ≥ 0, Gt = (Vt , Et ), where Et is an edge set, and Vt ⊆ S is a node set. The in-degree of a node v at time t is given by deg− (v, t). Likewise the out-degree is given by deg+ (v, t). The sphere of influence S(v, t) of a node v at time t is defined as the ball, centred at v, with total volume A1 deg− (v, t) + A2 , |S(v, t)| = t where A1 , A2 > 0 are given parameters. If (A1 deg− (v, t) + A2 )/t ≥ 1, then S(v, t) = S and so |S(v, t)| = 1. The generation of a SPA model graph begins at time t = 0 with G0 being the null graph. At each time step t ≥ 1 (defined to be the transition from Gt−1 to Gt ), a node vt is chosen from S according to the given spatial distribution, and added to Vt−1 to form Vt . Next, independently, for each node u ∈ Vt−1 such that vt ∈ S(u, t − 1), a directed link (vt , u) is created with probability p, p ∈ (0, 1) being another parameter of the model. We impose the additional restriction that pA1 max` ρ` < 1; this avoids regions becoming too dense. This property will be always assumed. Let δ(v) be the distance from v to the boundary of R(v). Let r(v, t) be the radius of the sphere of influence of node v at time t. So if r(v, t) ≤ δ(v), then S(v, t) is completely 4 contained in R(v) at time t. We see that 1/m r(v, t) = (|S(v, t)|/cm ) = A1 deg− (v, t) + A2 cm t 1/m , where cm is the volume of the unit ball; for example, in 2-dimensions with the Euclidean metric, c2 = π. As typical in random graph theory, we shall consider only asymptotic properties of Gn as n → ∞. We say that an event in a probability space holds asymptotically almost surely (a.a.s.) if its probability tends to one as n goes to infinity. We emphasize that the notations o(·) and O(·) refer to functions of n, not necessarily positive, whose growth is bounded. Since we aim for results that hold a.a.s., we will always assume that n is large enough. 2 Graph properties of the SPA model In this section we investigate typical properties of graphs produced by the inhomogeneous SPA model, aiming to use the results to infer the spatial distances between the nodes. A central observation is that in the inhomogeneous SPA model with a locally constant density function, the probability of an edge forming from a new node vt to an existing node v at time t equals Z X ρ` |S(v, t) ∩ R` |. P (vt , v) ∈ E(Gn ) = p ρdµ = p S(v,t) ` In the analysis of the original SPA model from [1], we find that spheres of influence of nodes that are born early typically shrink rapidly, while nodes born late start with small spheres of influence. A node would have to be quite close to the boundary of its region with another one for the effect of any other region to be felt. With this assumption, the expression for the link probability is very similar to that of the link probability of the original SPA model. Therefore, it seems reasonable to expect that the graph formed by nodes in a region R` with local density ρ` behaves like an independent SPA model of density ρ` . Our results will show that this expectation is justified and can be made rigorous. To be specific, assume that nodes in the SPA model do not arrive at fixed, discrete, time instances t, but instead arrive according to a homogeneous Poisson process with rate 1. (This will not significantly change the analysis but is a convenient assumption.) Then, the process inside a region R with density ρ will behave like the SPA model with the same parameters A1 , A2 and p, but with points arriving according to a Poisson process with rate ρ. This means that in each time interval we expect ρ points to arrive, and the expected time interval between arrivals equals 1/ρ. If we use vt to denote the t-th node arriving, then the arrival time a(t) of vt is approximately t/ρ, and thus the 5 volume of the sphere of influence of an existing node v at the time that vt is born equals ρA1 deg− (v, a(t)) + ρA2 A1 deg− (v, a(t)) + A2 ≈ . a(t) t |S(v, a(t))| = Thus, in the analysis of the degree of an individual node, we expect a node v in the inhomogeneous SPA model to behave like a node in the original SPA model with parameters ρ(v)A1 , ρ(v)A2 instead of A1 , A2 , where the degree of node v at time t in the inhomogeneous SPA model corresponds to the degree of a node at time a(t) in the corresponding SPA model. The following theorems show that this is indeed the case. Theorem 2.1. Let ω = ω(n) be any function tending to infinity together with n, and let ε > 0. The following holds with probability 1 − o(n−1 ). For every node v for which deg− (v, n) = k = k(n) ≥ ω 2 log n and for which δ(v) ≥ (1 + ε) A1 k + A2 cm n 1/m = (1 + ε)r(v, n), (1) it holds that for all values of t such that max{tv , Tv } ≤ t ≤ n, pρ(v)A1 t deg (v, t) = (1 + o(1))k . n − (2) Times Tv and tv are defined as follows: Tv = n ω log n k 1 pρ(v)A 1 , tv = (1 + ε) A1 k m δ(v) cm npρ(v)A1 1 1−pρ(v)A 1 . (3) Condition (1) on δ(v) ensures that at time n, S(v, n) is completely contained in R(v) (deterministically). In fact, due to the additional multiplicative factor of (1 + ε), S(v, n) is some distance removed from the boundary of R(v). The expression for Tv is chosen so that at this time node v has a.a.s. at least ω log n neighbours. Likewise, tv is chosen such that at this time a.a.s. the sphere of influence has shrunk so that its radius is sufficiently smaller than δ(v), again with some extra room to spare. The implication of this theorem is that once a node accumulates at least ω log n neighbours and its sphere of influence has shrunk so that it does not intersect neighbouring regions, its behaviour can be predicted with high probability until the end of the process, and is completely governed by its region, and no others. In particular, it follows that from time max{tv , Tv } onwards the sphere of influence is completely contained in R(v). For most vertices, the moment when they first achieve ω log n neighbours (Tv ,) will come before the moment that their sphere of influence has shrunk so that it is well contained in the region (tv ). Indeed, consider a vertex v of degree at least ω log n for which this is not the case. Let T be the moment when the vertex reaches in-degree 6 ω log n. By definition, the sphere of influence of v at this time T has a radius of influence n 1/m of order ω log . If T ω log n, then the radius is o(1), and the probability that T v is this close to the border is also o(1). The only vertices for which potentially the radius at time T could be fairly large are those vertices for which T = O(ω log n). Thus, these are the oldest vertices. These vertices do have high degree, but their spheres of influence still tend to shrink over time, so most of their edges will be acquired after time tv , that is, when their sphere of influence has shrunk to be contained in the region. We can use the results on the degree to show that each graph induced by one of the regions R` has a power law degree distribution. Let N` (j, n) denote the number of nodes of degree j at time n in the region R` . The proof of the following result is a straightforward adaptation of the differential equations method used to prove the counterpart result for the uniform model (see [1]). Since this theorem is not needed to prove the main result of this paper, the proof is omitted here. Theorem 2.2. A.a.s. the graph induced by the nodes in region R` has a power law degree distribution with coefficient 1 + 1/(pρ` A1 ). Precisely, a.a.s. for any 1 ≤ ` ≤ k m pρmax A1 there exists a constant c` such that for any 1 j ≤ jf = n/ log8 n 4pρmax A1 +2 , N` (j, n) = (1 + o(1))c` j −(1+ pρ 1A ) ` 1 q` n. Moreover, a.a.s. the entire graph generated by the inhomogeneous SPA model has a degree distribution whose tail follows a power law with coefficient 1 + 1/(pρmax A1 ). The number of edges also validates our hypothesis that a region of a certain density behaves almost as a uniform SPA model with adjusted parameters. In the original SPA model with parameters A1 and A2 replaced by ρA1 , ρA2 and p, the average out-degree pρA2 , as per [1, Theorem 1.3]. The following theorem shows that the is approximately 1−pρA 1 subgraph induced by one of the regions has the equivalent expected number of edges. This theorem also shows that a.a.s. the number of edges that cross the boundary of a region is of smaller order than the number of edges completely contained in that region. Thus, almost all edges have both endpoints in the same region. Theorem 2.3. A.a.s., for all regions R` of density ρ` , |V (Gn ) ∩ R` | = (1 + o(1))q` n. Moreover, E({(u, v) ∈ E(Gn ) | u, v ∈ R` }|) = (1 + o(1)) pρ` A2 q` n. 1 − pρ` A1 Furthermore, a.a.s. |{(u, v) ∈ E(Gn ) : R(u) 6= R(v)}| = o(n). Here we see that we need the condition pρmax A1 < 1. If pρmax A1 ≥ 1, then the number of edges would grow superlinearly. 7 Our ultimate goal is to derive the pairwise distances between the nodes in the metric space through an analysis of the graph. The following theorem, obtained using the approach of [14], provides an important tool. Namely, it links the number of common in-neighbours of a pair of nodes to their (metric) distance. Using this theorem, we can then infer the distance from the number of common in-neighbours. The theorem distinguishes three cases. If u and v are relatively far from each other, then they will have no common neighbours. If the nodes are very close, then the number of common neighbours is approximately equal to a fraction p of the degree of the node of smallest degree. The third case provides a ‘sweet spot’ where the number of common neighbours is a direct function of the metric distance and the degrees of the nodes. For any two nodes u and v, let cn(u, v, t) denote the number of common in-neighbours of u and v at time t. Theorem 2.4. Let ω = ω(n) be any function tending to infinity together with n, and let ε > 0. The following holds a.a.s. Let u and v be nodes of final degrees deg(u, n) = k and deg(v, n) = j such that R = R(u) = R(v), and k ≥ j ≥ ω 2 log n. Let ρ = ρ(v) = ρ(u) 1 n pρA1 , and assume that and let Tv = n ω log j m m δ(v) ≥ cj and δ(u) ≥ ck, where c = (1 + ε) A1 pρA cm n 1 Tv1−pρA1 . Let d(u, v) be the distance between u and v in the metric space. Then, we have the following result about the number of common in-neighbours of u and v: Case 1. If d(u, v) ≥ ε (ω log n)(k/j) Tv 1/m then cn(u, v, n) = O(ω log n). Case 2. If d(u, v) ≤ A1 k + A2 cm n 1/m − A1 j + A2 cm n 1/m then cn(u, v, n) = (1 + o(1))pj. Case 3. If A1 k + A2 cm n 1/m − A1 j + A2 cm n 1/m < d(u, v) < ε then cn(u, v, n) = Cjn −α α −mα k d 8 (ω log n)(k/j) Tv 1/m !! j 1 + o(1) + O , k 1/m , (4) where d = d(u, v) and α= pρA1 1 − pρA1 and C = pAα1 c−α m . Note that, if j k, then we have a precise asymptotic formula for cn(u, v, n). If j and k are approximately equal, then the formula only states that cn(u, v, n) = Θ(jn−α k α d−mα ). 3 Reconstruction of Geometry We set out to discover the character of nodes in a network purely through link structure, and to quantify the similarities. Spatial models allow us a convenient definition of similarity: distances between nodes. In examining the SPA model, the number of common neighbours allows us to uncover a good approximation of pairwise distances, a first step in the reconstruction of the geometry. Description of Model Used: For simulations, we use an inhomogeneous SPA model that we call a diagonal layout, which has 4 ‘clusters’ of identical high density, with m = 2. In the diagonal layout, k = 4 and the 4 regions (x, x), 1 ≤ x ≤ 4, are dense, with the others sparse. We will use ‘dense region’ and ‘sparse region’ to denote the union of all regions with densities ρd and ρs , respectively. For ease of notation, we note 1 that 16 (4ρd + 12ρs ) = 1, so ρs = 4/3 − ρd /3. Thus it is enough to provide the value of ρd only. In Figure 1 we see an example of the diagonal layout with nodes and edges, and we also see evidence that the densest region does dominate the power law degree distribution. The yellow line is the prediction for the degree distribution with the power law exponent based on the maximum density, as in Theorem 2.2. Our estimator for the distance is derived from Case 3 of Theorem 2.4, and in particular Equation (4), ignoring the error term. This leads to the following formula for the ˆ For a pair of nodes u, v with deg− (u, n) = k and deg− (v, n) = j, estimated distance d. k ≥ j, whose distance is such that Case 3 applies, this estimate is given by: ˆ v) = d(u, pγ A1 j γ k ncm cn(u, v, n)γ m1 , (5) 1 (note that γ = α1 with α as in Equation (4)). If the density is where γ = 1−pρ(u)A pρ(u)A1 uniform, that is, if ρ(u) = 1 for all u, then the above estimator is the same as our original estimator: Equation (7) from [14]. Since a relationship between the spatial distance only exists in Case 3 of Theorem 2.4, we try to eliminate pairs to which one of the other cases applies. Pairs which are in Case 1 are very close, and for such pairs, the expected number of common neighbours is p deg(v, n) = pj. In an attempt to avoid this case, we filter out all pairs where the number of common neighbours is greater than pj/2. Pairs that are in Case 2 are 9 16 1 14 Logarithm of Number of Vertices 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Observed Theoretical Slope 12 10 8 6 4 2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0 1 1 2 3 4 Logarithm of Degree 5 6 Figure 1: Left: diagonal layout, n = 1, 000, p = 0.6, ρd = 1.6, A1 = 0.7, A2 = 2.0; Right: degree distribution n = 1, 000, 000, p = 0.7, ρd = 1.2, A1 = 0.7, A2 = 1.0. so far apart that their spheres of influence have overlap for a very short time, if at all. We try to avoid this case by eliminating pairs with 10 or fewer common neighbours. To see the effect of the non-uniform density, we first apply the original estimator to our diagonal layout. In other words, we define the estimated distance as in Equation (5), but taking ρ(u) = 1 for all u, and we are applying this estimator to the points obtained from a non-uniform distribution. The motivation of this experiment is that, when applying our techniques to real-life data, we are not likely to know the local density of a node. Figure 2 (left side) gives the estimated versus real distance for a graph with n = 100, 000 nodes, generated via the SPA model from the diagonal layout with parameters p = 0.7, ρd = 1.6, A1 = 0.7, A2 = 2.0. After filtering as described above, 2,270 pairs are left. The figure shows that the approach of assuming uniform density leads to a consistent overestimate of the distance for the nodes. This may seem counterintuitive. The trouble lies with the estimator’s assumption about a node’s age, which is based on its final indegree. A node in Rd has more neighbours than is expected when one assumes uniform density, and thus the node is thought to be much older than it actually is. This confounds the distance estimator. Using the same simulation results, we now apply the estimator from Equation (5), and use our knowledge about ρ(u). The figure on the right in Figure 2 shows the estimated distance using Equation (5) vs. actual node distance. The results indicate that our new estimator is significantly more accurate in predicting distances for the pairs of nodes in the dense region. Let us mention that the estimation for pairs in the sparse region is still not accurate, while the estimation for cross-border pairs appears to be even worse. This is likely caused by the fact that nodes that are involved in cross-border pairs, and in sparse region pairs, and that have enough common neighbours to qualify to be included, are 10 Estimated Distance Estimated Distance 0.1 0.05 Cross Pairs Sparse Region 0.1 0.05 Sparse Region Dense Region Dense Region Perf. Est 0 0 0.05 0.1 Actual Distance Cross Pairs Perf. Est 0.15 0 0 0.05 0.1 Actual Distance 0.15 Figure 2: SPA model, n = 100, 000, diagonal layout, p = 0.7, ρd = 1.6, A1 = 0.7, A2 = 2.0, actual vs. estimated distances for pairs of nodes; Left: using original estimator; Right: using new estimator, density known likely the older nodes, i.e. they are born near the beginning of the process. For such nodes, in the early stages there is likely some overlap between their sphere of influence and the bordering, dense regions. Thus, the degree likely does not follow the prediction from Theorem 2.1, which, in turn, affects the performance of the distance estimator for those pairs. Better performance for cross-border pairs could possibly be obtained by using a linear combination of the densities in Equation (5). However, we will see in what follows that better performance for all pairs occurs when we use the data itself to estimate the density. Also, we point out that the pairs in the dense region constitute the large majority of all pairs. Moreover, the dense regions are those that are most likely to correspond to communities of interest. Therefore, accurate prediction for pairs in these regions is most important. Estimating the density: In real-world situations, we cannot assume to know the density of the region containing a given node. In fact, the density of the region containing a node is an important part of the ‘second layer of meaning’ which we aim to extract from the graph. Here we will show that our theoretical results give us a tool for estimating the local density around a node, using only its neighbourhood. We also apply our distance estimator once more, this time using the estimated density for our formula. Using the theoretical results obtained from the previous section, we can estimate the density of the region R(v) containing a given node v from the average out-degree of the in-neighbours of v. As per Theorem 2.3, the average out-degree in R` is approximately pρ` A2 . 1 − pρ` A1 (6) If we have a large enough set of nodes from the same region, then we can use the formula above to estimate the density of the region. Consider a node v, and make 11 two assumptions: (i) almost all neighbours of v are contained in R(v), and (ii) the neighbours of v form a representative sample of all nodes of R(v). Simulations show that these assumptions are justified and allow us to make an estimate for ρ(v). Assumption (i) is additionally justified by the second part of Theorem 2.3, which states that the number of edges crossing the border is negligible compared to the total number of edges. + Set deg (v) to be the average out-degree of the in-neighbours of v. Specifically, + deg (v) = 1 deg− (v) X deg+ (u), u∈N − (v) where N − (v) is the set of in-neighbours of v. Given our assumptions, an estimator for the density in R(v), denoted by ρ̂(v), can be derived from this average out-degree, using Equation (6): + deg (v) . ρ̂(v) = + pA2 + pA1 deg (v) + The left side of Figure 3 shows a histogram of the values of deg (v) for our simulated graph. Displayed are the results for nodes with deg− (v) ≥ 10. The graph is obtained from the SPA model where points have the previously described diagonal layout, with density ρd = 1.6 in the dense region, and consequently density ρs = 0.8 in the sparse + region. For these parameters, Equation (6) gives a theoretical value of 5.85 for deg (v) if node v lies in the dense region, and a value of 1.45 if v lies in the sparse region. + We see in Figure 3 (left side) that the values of deg (v) in the dense region are quite accurate, with peaks occurring around the calculated value of 5.85. For the sparse region, the peaks occur around 2.5, giving an estimate for the average out-degree which is higher than expected. Likely, this is caused by nodes in the sparse region that are located close to the border, and thus are likely to have neighbours in the dense region. Such nodes also tend to have high degree, and our condition on the minimum degree favours the ‘rich’ sparse region nodes. Figure 3 (right side) gives a histogram of the estimated densities of the nodes. For nodes in the dense region, the true value is 1.6, and we see a good estimation of this value for these nodes. For nodes in the sparse nodes, the true value is 0.8, while the peak of the estimated densities occurs around 1.15, and almost all values are greater than 0.8. Again, this is likely caused by nodes whose sphere of influence overlapped with the dense region. To obtain better performance for nodes in the sparse region, we propose to base our estimated density for a node v in the sparse region only on the out-degree of neighbours of v of low in-degree; such neighbours are young and so the sphere of influence of v had shrunk, and thus was more likely to be fully inside the sparse region, when the neighbours were born. To obtain density estimates for nodes with small in-degree, we can take the second neighbourhood to compute the average out-degree. Nodes with small in-degrees are young, so even second neighbours are likely to be close. We plan to explore these possibilities in future work. 12 150 150 Dense Region Sparse Region 100 Number of Vertices Number of Vertices Dense Region 50 0 0 5 10 Average Out−degree of In−Neighbours 50 0 15 Sparse Region 100 0.8 1 1.2 1.4 1.6 Calculated Density 1.8 2 Figure 3: Diagonal layout, n = 100, 000, ρd = 1.6, A1 = 0.7, A2 = 2.0, p = 0.6; Left: average out-degree of the in-neighbours; Right: calculated density from average out-degree. Finally, we use ρ̂, and known values of all other parameters, to calculate the distance between the nodes based on the number of common neighbours, Equation (5), using the same simulation results as those we used earlier. Here we use the calculated density of the node of higher degree in the distance formula. (Using the lower degree node gives similar results.) The results are seen in Figure 4. The figure shows that there is very good agreement between calculated and estimated densities. In fact, we see that the agreement is greatly improved for the crossborder pairs, and also better for the sparse region pairs. This can be understood as follows. The distance estimator is derived indirectly from Theorem 4.1, which predicts the approximate degree of a node throughout the process, based on its final degree and the density of its region. For nodes in the sparse region which have a sizeable number of neighbours in the sparse region, the degree will be larger than predicted using this method but also, the density estimator will predict a higher density. So the estimated density is a better indicator of the behaviour of the degree than the real density, and thus the distance estimator gives better performance. This indicates that this last variation of the distance estimator is the most robust against local fluctuations in density. Thus we have a good prognosis for the applicability of the estimator on real data, where such fluctuations are to be expected. 4 Proofs In this section, we give the proofs of the main theorems. Our results all refer to typical behaviour of the random SPA model process, and are asymptotic in n, the number of vertices. We will sometimes use the stronger notion of w.e.p. in favour of the more commonly used a.a.s., since it simplifies some of our proofs. We say that an event holds with extreme probability (w.e.p.), if it holds with probability at least 1−exp(−ω(n) log n) 13 Cross Pairs Estimated Distance Sparse Region 0.1 Dense Region Perf. Est 0.05 0 0 0.05 0.1 Actual Distance 0.15 Figure 4: Diagonal layout, n = 100, 000, ρd = 1.6, A1 = 0.7, A2 = 2.0, p = 0.7: Distance estimation using estimated density from the node of greater final degree, all other parameters known as n → ∞, where ω(n) is any function tending to infinity together with n. Thus, if we consider a polynomial number of events that each holds w.e.p., then w.e.p. (and hence also a.a.s.) all events hold. First we state and prove a theorem that bounds the in-degree of any node, regardless of its distance of the boundary. Theorem 4.1. Let ω = ω(t) be any function tending to infinity together with t. The expected in-degree at time t of a node vi born at time i ≥ ω satisfies pρmax A1 A2 A2 t − − . E(deg (vi , t)) ≤ (1 + o(1)) A1 i A1 p(2−m ρ(v)+(1−2−m )ρmin )A1 A2 t A2 − E(deg (vi , t)) ≥ (1 + o(1)) − . A1 i A1 Moreover, for any node vi born at time i ≥ 1 we have pρmax A1 A2 eA2 t − E(deg (vi , t)) ≤ − . A1 i A1 Proof. In order to simplify calculations, we make the following substitution: Y (vi , t) = deg− (vi , t) + t A2 = |S(vi , t)|. A1 A1 It follows immediately from the definition of the process that Y (vi , i) = A2 /A1 and for t≥i ( Y (vi , t) + 1, with probability at most pρmax At1 Y (vi , t), Y (vi , t + 1) = Y (vi , t), otherwise. 14 We couple Y (vi , t) with another random variable X(vi , t) so that Y (vi , t) ≤ X(vi , t) for t ≥ i. Random variable X(vi , t) is defined as follows: X(vi , i) = A2 /A1 and for t ≥ i ( X(vi , t) + 1, with probability pρmax At1 X(vi , t) X(vi , t + 1) = X(vi , t), otherwise. Finding the conditional expectation, pρmax A1 . E(X(vi , t + 1) | X(vi , t)) = X(vi , t) 1 + t Taking expectations, we get pρmax A1 E(X(vi , t + 1)) = E(X(vi , t)) 1 + t , and, since X(vi , i) = A2 /A1 , t−1 pρmax A1 A2 Y 1+ E(X(vi , t)) = A1 j=i j ! t−1 X A2 pρmax A1 ≤ exp A1 j j=i t A2 ≤ exp pρmax A1 log + 1/i A1 i pρmax A1 eA2 t . ≤ A1 i If i ≥ ω, we have t−1 X A2 pρmax A1 E(X(vi , t)) = (1 + o(1)) exp A1 j j=i ! A2 = (1 + o(1)) A1 pρmax A1 t . i This shows the upper bound. For the lower bound, we first observe that for all nodes v, |S(v, t) ∩ R(v)| ≥ (1/2)m |S(v, t)|. Thus the node vt links to vi with probability at least p(2−m ρ(v) + (1 − 2−m )ρmin )|S(vi , t)|. Using this, we can use the exact same approach to bound the expectation of Y (vi , t) from below. This gives the lower bound. 15 Proof of Theorem 2.1 Here we show that, once a vertex has reached an in-degree of ω log n and its area of influence is well contained within the region, its degree can be closely predicted with high probability. We will be using the following version of the Chernoff bound, as seen in e.g. [11, p. 27, Corollary 2.3]. Lemma 4.2. Let X be a random variable that can be expressed as a sum of independent Pn random indicator variables, X = i=1 Xi , where Xi ∈ Be(pi ) with (possibly) different pi = P(Xi = 1) = EXi . If ε ≤ 3/2, then 2 ε EX . (7) P(|X − EX| ≥ εEX) ≤ 2 exp − 3 Let us start with the following key lemma. Lemma 4.3. Let ω = ω(n) be any function tending to infinity together with n, and let ε > 0. For a given node v, suppose that deg− (v, T ) = d ≥ ω log n and that δ(v) ≥ (1 + ε) A1 d + A2 cm T 1/m = (1 + ε)r(v, T ). Then, with probability 1 − O(n−3 ), for every value of t, T ≤ t ≤ 2T , pρ(v)A1 t tp 3 − · d log n. deg (v, t) − d · ≤ pρ(v)A1 T T Proof. Let ρ = ρ(v). Our goal is to estimate deg− (v, t) − d · (t/T )pρA1 . We will show that the upper bound holds; the lower bound can be obtained by using an analogous, symmetric, argument. Note that the assumption on δ(v) implies that S(v, T ) ⊆ R(v). We use the following stopping time ! ) ( pρA1 p 3 t t + · d log n ∨ (t = 2T + 1)) . T0 = min t ≥ T : deg− (v, t) > d · T pρA1 T Note that if T0 = 2T + 1, then the in-degree of v remained bounded as required during the entire time interval T ≤ t ≤ 2T . Hence, in order to prove the bound, we need to show that with probability 1 − O(n−3 ) we have T0 = 2T + 1. Suppose that T0 ≤ 2T . Note that for t ≥ T up to and including time-step T0 − 1, the random variable deg− (v, t) is (deterministically) bounded from above. Moreover, it is straightforward to see that this upper bound, together with the assumption on δ(v) (note the additional multiplicative (1 + ε) term), implies that S(v, t) ⊆ R(v) for all T ≤ t ≤ T0 − 1. Hence, the number of new neighbours accumulated during this phase 16 of theP process, deg− (v, T0 ) − d, can be (stochastically) bounded from above by the sum T0 −1 X = t=T Xt of independent indicator random variables Xt , where pρA1 √ 3 t + pρA · d log n + A2 A1 d Tt T 1 . P(Xt = 1) = pρ t Clearly, since pρA1 < 1, EX = TX 0 −1 EXt t=T = pρA1 dT −pρA1 TX 0 −1 ! tpρA1 −1 t=T T0 T pρA1 T0 T pρA1 = d = d + T0 − T p 3 d log n + O(1) T pρA1 T T0 − T p 3 d log n + O(1) −d + T T T0 − T p 3 d log n + O(1). −d+ T Since T0 ≤ 2T , the in-degree of v at time T0 failed the desired condition, which implies that X ≥ deg− (v, T0 ) − d ! pρA1 p T0 T0 3 · d log n − d ≥ d· + T pρA1 T 3 T0 p T0 − T p = EX + · d log n − 3 d log n + O(1) pρA1 T T p ≥ EX + 3 d log n, using again that it is assumed that pρA1 < 1. It follows from the Chernoff bound (7) that p p P(|X − EX| ≥ 3 d log n) ≤ 2 exp − ε d log n , √ where ε = 3 d log n/EX. The maximum value of EX corresponds to T0 = 2T and so EX ≤ d 2T T pρA1 = d(2pρA1 ≤ d. 2T − T p 2 d log n + O(1) T − 1)(1 + o(1)) −d+ p So ε ≥ 3 d−1 log n. Therefore, the probability that T0 ≤ 2T is at most 2 exp(−3 log n) and the proof is finished. 17 Now, with Lemma 4.3 in hand we can get Theorem 2.1. Proof of Theorem 2.1. Let ω = ω(n) be a function going to infinity with n, and let ε > 0. Let v be a vertex with final degree k ≥ ω log n, let ρ = ρ(v), and assume that δ = δ(v) ≥ (1+ε)r(v, n). Let T̂v be the first time that the in-degree of v exceeds (ω/2) log n, and t̂v be the first time t that the radius of influence r(v, t) ≤ δ(1 + ε/2)−(1−pρA1 )/m . Moreover, let T = max{Tˆv , t̂v } be the first time that the two events hold. Finally, let d = deg− (v, T ). We obtain from Lemma 4.3 that, with probability 1 − O(n−3 ), pρA1 pρA1 3 p −1 t 3 p −1 t − 1− d log n ≤ deg (v, t) ≤ d 1+ d log n d T pρA1 T pρA1 for T ≤ t ≤ 2T . It follows that the degree tends to grow but the sphere of influence tends to shrink between T and 2T , and thus that the conditions of Lemma 4.3 again hold at time 2T . We can now keep applying the same lemma for times 2T , 4T , 8T , 16T, . . . , using the final value as the initial one for the next period, to get the statement for all values of t from T up to and including time n. Precisely, for 1 ≤ i < imax = blog nc + 1, pρA1 let di = deg− (v, 2q (1+εi ), i T ). Then by Lemma 4.3, we have for i > 1 that di ≤ di−1 2 3 d−1 where εi = pρA i−1 log n. Since we apply the lemma O(log n) times (for a given 1 vertex v), the following statement holds with probability 1 − o(n−2 ) from time T on: for any 2i−1 T ≤ t < 2i T , we have that pρA1 Y i t deg (v, t) ≤ d (1 + εi ). T j=0 − It remains to make sure that the accumulated multiplicative error term is still only (1 + o(1)). For that, let us note that i i Y Y (1 + εi ) = 1+ j=0 j=1 3 p −1 −pρA1 j d 2 log n pρA1 i X 3 p −1 = (1 + o(1)) exp d log n 2−pρA1 j/2 pρA1 j=1 p = (1 + o(1)) exp O( d−1 log n) ! = 1 + o(1), since d grows faster than log n. A symmetric argument can be used to show a lower bound for the error term and so the result holds. It follows that we have the desired behaviour from time T . Precisely, for times T ≤ t ≤ n, we have that pρA1 t − deg (v, t) = d (1 + o(1)), T 18 where d = deg− (v, T ) ≥ deg− (v, T̂v ) ≥ (ω/2) log n. As T = max{T̂v , t̂v }, we need to consider two cases. Suppose first that T = T̂v . Setting t = n and deg− (v, n) = k, we obtain that 1/pρA1 1/pρA1 1/pρA1 ω log n 1 d n = (1 + o(1)) n = (1 + o(1)) Tv . T = (1 + o(1)) k 2k 2 Therefore, for large enough n, we have that T < Tv ≤ max{tv , Tv }. Suppose then that T = t̂v . By definition, r(v, T ) = (1 + o(1))δ(1 + ε/2)−(1−pρA1 )/m and, since d > ω log n, r(v, T ) = (1 + o(1)) = (1 + o(1)) A1 d cm T 1/m 1/m A1 k cm T 1−pρA1 npρA1 , and so T = (1 + o(1))(1 + ε/2) A1 k m δ cm npρA1 1/(1−pρA1 ) = (1 + o(1)) 1 + ε/2 tv . 1+ε Again, for large enough n, we have that T < tv ≤ max{tv , Tv }. In either case, T < max{tv , Tv }. As a result, we obtain that, for max{tv , Tv } ≤ t ≤ n, pρA1 t − (1 + o(1)). deg (v, t) = k n Finally, since the statement holds for any vertex v with probability 1 − o(n−2 ), with probability 1 − o(n−1 ) the statement holds for all vertices. The proof of the theorem is finished. Let us note that Theorem 2.1 immediately implies the following two corollaries. Corollary 4.4. Let ω = ω(n) be any function tending to infinity together with n, and let ε > 0. The following holds with probability 1 − o(n−1 ). For every node v, and for every time T so that deg− (v, T ) ≥ ω log n and (1 + ε)r(v, T ) ≤ δ(v), for all times t, T ≤ t ≤ n, pρ(v)A1 t − − deg (v, t) = deg (v, T ) (1 + o(1)). T Corollary 4.5. Let ω = ω(n) be any function tending to infinity together with n. The following holds with probability 1 − o(n−1 ). For any node vi born at time i ≥ 1, and i ≤ t ≤ n we have that pρmax A1 t deg(vi , t) ≤ ω log n . (8) i 19 Proof. The statement is trivially true if deg(vi , t) ≤ ω log n so we may assume that t is such that deg(vi , t) > ω log n. Let T be the first time that the in-degree of v exceeds (ω/2) log n; clearly, deg− (v, T ) = (1/2 + o(1))ω log n. First assume that (1 + ε)r(v, T ) < δ(v) for some ε > 0. By Corollary 4.4, pρmax A1 pρ(v)A1 t t − − (1 + o(1)) ≤ ω log n , deg (v, t) = deg (v, T ) T i where the last step follows since we may assume that n large enough so that the (1+o(1)) term is less than 2, and the fact that T > i and ρ ≤ ρmax . However, even if v is close to the border of the region, that is, (1+o(1))r(v, T ) ≥ δ(v), this argument applies. Namely, in this case the degree of v is stochastically bounded above by the degree of a node with the same birth time, but born in a region with density ρmax , and position far from the border. Therefore, the argument above still applies. Proof of Theorem 2.3 Let Zj be the indicator variable of the event {vj ∈ R` }. By definition of the process, Zj is a Bernouilli variable with expectation q` , and Pn the variables Zj are independent for different values of j. Thus, |V (Gn ) ∩ R` | = j=1 Zj is the sum of n independent binomial random variables. Thus E(|V (Gn )∩R` |) =√q` n, and it follows from Lemma 4.2 that for every `, a.a.s. ||V (Gn ) ∩ R` | − q` n| = O(ω n) = o(n), where ω = ω(n) is any function tending to infinity together with n. Since the number of regions is assumed to be a constant independent of n, the desired property holds a.a.s. for all regions. For the second statement, we examine the number of edges whose endpoints are in ` the same region, that is, the number of edges that do not cross a boundary. We set Mt to be {(u, v) ∈ E(Gt ) | u, v ∈ R` }. We create the indicator variable Xi,j , such that ( 1, if vi , vj ∈ R` and (vi , vj ) ∈ E(Gn ) Xi,j = 0, otherwise. Thus, X ` |Gt ) = Mt` + E(Mt+1 E(Xt+1,j |Gt ) vj ∈R` ,j≤t X = Mt` + pρ` |S(vj , t) ∩ R` | vj ∈R` ,j≤t X = Mt` + pρ` |S(vj , t)| − pρ` Yt` , vj ∈R` ,j≤t where Yt` = X |S(vj , t) \ R` |, vj ∈R` ,j≤t 20 (9) Using the definition of S(vj , t), we obtain ` E(Mt+1 |Gt ) = Mt` + vj ∈R` = Mt` + A1 deg− (vj , t) + A2 pρ` − pρ` Yt` t ,j≤t X pρ` A1 (Mt` + Zt` ) pρ` A2 + |Vt ∩ R` | − pρ` Yt` , t t (10) where Zt` = |{(u, v) ∈ E(Gt ) | v ∈ R` , u 6∈ R` }|. Let α = pρmax A1 . By Corollary 4.5, for any function ω = ω(n) tending to infinity together with n, with probability 1−o(n−1 ), for all vertices vi and for all times i ≤ t ≤ n, α t − . deg (vi , t) ≤ ω log n i Let G be the event that we have these upper bounds for all vertices vi and for all times i ≤ t ≤ n. Assume first that G holds. It follows (deterministically) that for all vertices vi and for all times i ≤ t ≤ n, A1 deg− (vi , t) + A2 ≤ (ω 2 log n)tα−1 i−α , |S(vi , t)| = t and thus r(vi , t) ≤ (c−1/m ω 2/m log1/m n)t(α−1)/m i−α/m ≤ (ω 3 log1/m n)t(α−1)/m i−α/m , m as usual, assuming that n is large enough. Let Xi be the indicator variable of the event {S(vi , t) 6⊆ R` }. Using the bound on |S(vi , t)|, we obtain X X Yt` ≤ Xi |S(vi , t)| ≤ Xi (ω 2 log n)tα−1 i−α . i≤t i≤t If Xi = 1 then vi has distance at most (ω 3 log1/m n)t(α−1)/m i−α/m from the boundary of R` . Since the length of the boundary of R` is at most 4, the boundary strip in which vj must be located has area at most (4ω 3 log1/m n)t(α−1)/m i−α/m . Thus E(Xi |G) ≤ (4ω 3 log1/m n)t(α−1)/m i−α/m . Combining this with the bound on Yt` , we obtain X E(Yt` |G) ≤ (4ω 3 log1/m n)t(α−1)/m i−α/m (ω 2 log n)tα−1 i−α i≤t 1 = (4ω 5 log1+1/m n)t−(1−α)(1+ m ) X i≤t 5 ≤ c(ω log 2+1/m −β n)t 21 , 1 i−α(1+ m ) where β = min{ m1 , (1 − α)(1 + m1 )} and c is an appropriate constant independent of t and n. Note that β > 0, and thus, for t ≥ log4/β n, the expectation of Yt` is o(1) and so negligible (as ω can be tending to infinity arbitrarily slowly). Next, consider the number of cross-border edges, Zt` . At time t, an edge from vt to vi , i < t, which contributes to Zt` can only be created if vt ∈ S(vi , t) \ R` . The probability that such an edge is created is at most pρmax |S(vi , t) \ R` |. Thus we have that X ` ` ` E(Zt` |Zt−1 ) = Zt−1 + pρmax |S(vi , t) \ R` | = Zt−1 + pρmax Yt` . vi ∈R` Therefore, E(Zt` |G) = t X pρmax E(Yτ` |G) ≤ c0 (log4 n)t1−β , τ =1 0 for some constant c independent of t and n. By taking the expectation of (10) and setting m`t = E(Mt` |G), we obtain the following recurrence for the (conditional) expected value m`t and t ≥ log1+4/β n, pρ` A1 ` ` + pρ` q` A2 + o(1). (11) mt+1 = mt 1 + t To solve this recurrence, we use the following lemma on real sequences, which is Lemma 3.1 from [3]. Lemma 4.6. If (αt ), (βt ) and (γt ) are real sequences satisfying the relation βt αt + γt , αt+1 = 1 − t and limt→∞ βt = β > 0 and limt→∞ γt = γ, then limt→∞ Using this lemma, we see that limt→∞ solution of the recurrence (11) is: m`t t m`n = (1 + o(1)) = αt t exists and equals pρ` A2 q, 1−pρ` A1 ` γ . 1+β and thus the (first-order) pρ` A2 q` n. 1 − pρ` A1 (12) Here we use that pρ` A1 < 1 as given. Note that the o(1) term in the recurrence and the lower bound on t only affect the (1 + o(1)) term of the solution. Finally, since P(Ḡ) = o(n−1 ) and (deterministically) Mn` ≤ n2 , we get E(Mn` ) = P(G)m`n + P(Ḡ)E(Mn` |Ḡ) = (1 + o(1))m`n + o(n) = (1 + o(1)) This completes the proof. 22 pρ` A2 q` n. 1 − pρ` A1 Proof of Theorem 2.4 Fix ω and ε as in the statement of the theorem. Let u and v be nodes of final degrees deg− (u, n) = k and deg− (v, n) = j such that k ≥ j ≥ ω 2 log n, where both u and v are located in a region R with density ρ. Let tv and Tv be defined as in (3). Assume that the conditions of the theorem hold, that is, A1 m m . (13) δ(v) ≥ cj and δ(u) ≥ ck, where c = (1 + ε) cm npρA1 Tv1−pρA1 Since Tv = n (ω log n/j)1/pρA1 ≤ nω −1/pρA1 = o(n), it follows that 1/m 1/m A1 k + A2 A1 j + A2 and δ(u) ≥ (1 + ε) . δ(v) ≥ (1 + ε) cm n cm n Therefore, the conditions of Theorem 2.1 are satisfied. Thus, from time max{Tv , tv } until the end of the process we have concentration of the degree of node v, as given 1/(1−pρA1 ) A1 j by (2). Recall that tv = (1 + ε) cm npρA1 δ(v)m as in (3). Rewriting condition (13), we obtain that 1/(1−pρA1 ) A1 j 1/(1−pρA1 ) Tv ≥ (1 + ε) > tv . cm npρA1 δ(v)m Thus max{Tv , tv } = Tv and so, in fact, the degree of v is concentrated from time Tv until time n. Similarly, let tu and Tu be defined as in (3). The argument above, with j replaced by k, shows that max{tu , Tu } = Tu . By definition, as k ≥ j, it follows that Tu ≤ Tv . Hence, by Theorem 2.1, we have concentration of the degrees of both u and v from time Tv on. Precisely, we have that, with probability (1 − o(n−1 )), for all pairs u and v satisfying the conditions as stated above, and for all times t, Tv ≤ t ≤ n, pρA1 pρA1 t t − − and deg (v, t) = (1 + o(1))j . (14) deg (u, t) = (1 + o(1))k n n The rest of the proof proceeds as the proof of Theorem 3.1 in [14]. Case 1. By (14) above and the definition of Tv and Tu , we have that deg− (v, Tv ) = (1 + o(1))ω log n, and deg− (u, Tv ) = (1 + o(1))(ω log n)(k/j). This implies that the radius of influence of u satisfies: 1/m ! (ω log n)(k/j) r(u, Tv ) = Θ . Tv The condition assumed in this case is that 1/m (ω log n)(k/j) d(u, v) ≥ ε . Tv 23 Thus, at time Tv , both radii r(u, Tv ) and r(v, Tv ) are O(d(u, v)). Moreover, both radii decrease (in order) from time Tv onwards, whereas d(u, v) is independent of time. Thus there must exist a constant c (depending on ε but not on n) such that, for all times t, cTv ≤ t ≤ n, the areas of influence of u and v are disjoint. Then the total number of common neighbours could only be, at most, min{deg− (u, cTv ), deg− (v, cTv )} = O(ω log n). Case 2. Suppose d satisfies d(u, v) ≤ A1 k + A2 cm n 1/m − A1 j + A2 cm n 1/m = r(u, n) − r(v, n). Note that this condition implies (deterministically) that at time n the sphere of influence of v is contained in the sphere of influence of v. We now see that this situation occurs in approximate form throughout the process. From (14), we see that, for Tv ≤ t ≤ n, 1/m deg(v, t) = (1 + o(1)) kj deg(u, t), and thus r(v, t) = (1 + o(1)) kj r(u, t). If j ≤ αk for some constant α ∈ (0, 1), then we have that the difference between 1/m r(v, t) and r(u, t) behaves as (1 + o(1))cr(u, t), where c = kj ≤ α1/m < 1. In particular, this implies that this difference tends to shrink over time. Thus we have that, for Tv ≤ t ≤ n, d(u, v) ≤ r(u, n) − r(v, n) ≤ (1 + o(1))(r(u, t) − r(v, t)). Therefore, all but a negligible fraction of the sphere of influence of v lies inside the sphere of influence of u during the whole process. If k = (1 + o(1))j, so ( kj )1/m = 1 + o(1), the difference between r(v, t) and r(u, t) is o(r(u, t)) for Tv ≤ t ≤ n, and specifically at time t = n. From the condition of this case, we know that at time n, S(v, n) is completely contained inside S(u, n). This, combined with the fact that r(v, t) − r(u, t) = o(r(u, t)) implies that the spheres of influence of u and v overlap in all but a negligible part during the entire process. Any vertex w that links to v must lie inside the sphere of influence of v. Since most of the sphere of influence of v is contained in that of u in both cases mentioned above, this means that it is likely that w lies inside the sphere of influence of u as well, and thus has a probability p of also linking to v. Accounting for the small variation in the size of the spheres of influence, we have that the probability that a neighbour of u, added between times Tv and n, is also a neighbour of v is (1 + o(1))p. The number of common neighbours accumulated until time Tv is at most deg(u, Tv ) = O(ω log n), which is of smaller order than j, the final degree of u. Therefore, Ecn(u, v, n) = (1 + o(1))pj. Finally, note that the number of common neighbours is a sum of independent random indicator variables with Bernouilli distribution; each variable corresponds to a situation when a neighbour of v falls into a sphere of influence of u. It follows that cn(u, v, n) ∈ Bin((1 + o(1))j, p). The concentration follows from the bound (4.2). 24 Case 3. Suppose that d = d(u, v) satisfies 1/m (ω log n)(k/j) −1/m r(u, n) − r(v, n) < d(u, v) < ε = (1 + o(1))εA1 r(u, Tv ). Tv The analysis for this case is based on the assumption that, at time Tv , the sphere of influence of v is contained in that of u, and at time n the spheres are disjoint. Thus, in some narrow time interval around a time t0 between times Tv and n, the spheres become separated, and after this no more common neighbours can be formed. However, the conditions of this case do not guarantee that we have this situation: it may be that the conditions hold, but the sphere of influence of v is not completely contained in that of u at time Tv , or that the spheres are not completely disjoint at time n. However, the conditions guarantee that the asymptotic behaviour still applies in this case. Let t− be the first time instance after Tv when S(u, t) is not completely contained in S(v, t). Let t+ be the last time when the spheres overlap, or t+ = n if the spheres overlap at time n. (So Tv ≤ t− ≤ t+ ≤ n). From time Tv until time t− , each neighbour of u will be a common neighbour of v and u with probability p. From time t+ to n, no common neighbours can be created. From time t− until time t+ , the probability that a neighbour of u becomes a neighbour of v is at most p. Thus, p deg− (u, t− ) and p deg− (u, t+ ) form a lower and an upper bound, respectively, on the expected number of common neighbours of u and v. Since the centres of S(u, t) and S(v, t) are at distance d from each other, the definition of t− and t+ translate into the following conditions on the radii of the spheres of influence: r(v, t− ) − r(u, t− ) = (1 + o(1))d, r(v, t+ ) + r(u, t+ ) = (1 + o(1))d. (The factor (1 + o(1)) is caused by the fact that the spheres of influence increase or decrease in discrete amounts.) Using Equation (14) for the degree of u and v from time Tv , and translating this into conditions on the radius of the sphere of influence, we obtain r(v, t− ) − r(u, t− ) = (1 + o(1)) A 1 k = (1 + o(1)) − pρA1 t n cm t− 1/m + A2 A1 −pρA1 − pρA1 −1 kn (t ) cm 1/m A1 j − − pρA1 t n cm t− 1/m + A2 1/m ! j 1− . k A similar argument shows that r(v, t+ ) + r(u, t+ ) = (1 + o(1)) A1 −pρA1 + pρA1 −1 kn (t ) cm 25 1/m 1/m ! j 1+ . k Define t0 = (t+ + t− )/2. Then, by the above, we get that t− = t0 (1 − O((j/k)1/m )), t+ = t0 (1 + O((j/k)1/m )), and t0 = (1 + o(1)) A1 k −pρA1 −m d n cm 1 1−pρA 1 . It follows from the discussion from Case 2 that a.a.s. the number of common neighbours of u and v is bounded from below by (1 + o(1))p deg− (u, t− ), and from above by (1 + o(1))p deg− (v, t+ ). Using our knowledge about the behaviour of the in-degree of u given by (14), this leads to the following expression: 1/m !! pρA1 j t0 1 + o(1) + O cn(u, v, n) = pj n k pρA1 1−pρA 1/m !! pρA1 1 A1 −m j − = pjn 1−pρA1 1 + o(1) + O kd . cm k This concludes the proof. References [1] W. Aiello, A. Bonato, C. Cooper, J. Janssen, and P. Pralat. A spatial web graph model with local influence regions. Internet Mathematics, 5(1-2):175–196, 2008. [2] M. Bradonjić, A. Hagberg, and A. Percus. The structure of geographical threshold graphs. Internet Math., 5:113–140, 2008. [3] F. Chung and L. Lu. Complex Graphs and Networks. American Mathematical Society, August 2006. [4] C. Cooper, A. Frieze, and P. Pralat. Some typical properties of the spatial preferred attachment model. Internet Mathematics, 10:27–47, 2014. [5] A. Flaxman, A. Frieze, and J. Vera. A geometric preferential attachment model of networks. Internet Mathematics, 3(2):187–206, 2006. [6] A. Flaxman, A. Frieze, and J. Vera. A Geometric Preferential Attachment Model of Networks II. Internet Mathematics, 4(1):87–112, 2007. [7] J. Hasslegrave and J. Jordan. Preferential attachment with choice. Random Structures & Algorithms, 2015. To appear. [8] P. Hoff, A. Raftery, and M. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97:1090–1098, 2001. 26 [9] E. Jacob and P. Mörters. arXiv:1504.00618, 2015. Robustness of scale-free spatial networks. [10] E. Jacob and P. Mörters. Spatial preferential attachment networks: Power laws and clustering coefficients. Annals of Applied Probability, 25(2):632–662, 2015. [11] S. Janson, T. Luczak, and A. Ruciński. Random Graphs. Wiley, New York, USA, 2000. [12] J. Janssen, M. Hurshman, and N. Kalyaniwalla. Model selection for social networks using graphlets. Internet Mathematics, 8(4):338–363, 2012. [13] J. Janssen, P. Pralat, and R. Wilson. Asymmetric distribution of nodes in the spatial preferred attachment model. In Proceedings of the 10th Workshop on Algorithms and Models for the Web Graph (WAW 2013), volume 8305 of Lecture Notes in Computer Science, pages 1–13. Springer, 2013. [14] J. Janssen, P. Pralat, and R. Wilson. Geometric graph properties of the spatial preferred attachment model. Advances in Applied Mathematics, 50(2):243–267, 2013. [15] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Knowledge Discovery and Data Mining, pages 538–543, 2002. [16] J. Jordan. Degree sequences of geometric preferential attachment graphs. Advances in Applied Probability, 42:319–330, 2010. [17] J. Jordan. Geometric preferential attachment in non-uniform metric spaces. Electronic Journal of Probability, 18(8):1–15, 2013. [18] J. Jordan and A. Wade. Phase transitions for random geometric preferential attachment graphs. Advances in Applied Probability, 47(2):565–588, 2015. [19] M. Kobayakawa, S. Kinjo, M. Hoshi, T. Ohmori, and A. Yamamoto. Fast computation of similarity based on jaccard coefficient for composition-based image retrieval. In Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, PCM ’09, pages 949–955. SpringerVerlag, 2009. [20] H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265–269, 1973. 27