Download Non-Uniform Distribution of Nodes in the Spatial Preferential

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Stochastic geometry models of wireless networks wikipedia , lookup

Transcript
Non-Uniform Distribution of Nodes in the Spatial
Preferential Attachment Model
Jeannette Janssen∗
Pawel Pralat†
Rory Wilson‡
October 14, 2015
Abstract
The spatial preferential attachment (SPA) is a model for complex networks.
In the SPA model, nodes are embedded in a metric space, and each node has a
sphere of influence whose size increases if the node gains an in-link, and otherwise
decreases with time. In this paper, we study the behaviour of the SPA model when
the distribution of the nodes is non-uniform. Specifically, the space is divided into
dense and sparse regions, where it is assumed that the dense regions correspond to
coherent communities. We prove precise theoretical results regarding the degree
of a node, the number of common neighbours, and the average out-degree in
a region. Moreover, we show how these theoretically derived results about the
graph properties of the model can be used to formulate a reliable estimator for
the distance between certain pairs of nodes, and to estimate the density of the
region containing a given node.
Keywords— Spatial Random Graphs, Spatial Preferential Attachment Model,
Preferential Attachment, Complex Networks, Web Graph, Co-citation, Common Neighbours
1
Introduction
There has been a great deal of recent interest in modelling complex networks, a result
of the increasing connectedness of our world. The hyperlinked structure of the Web,
citation patterns, friendship relationships, infectious disease spread, these are seemingly
disparate linked data sets which have fundamentally very similar natures.
Many models of complex networks have a common weakness: the ‘uniformity’ of the
nodes; other than link structure there is no way to distinguish the nodes. One family of
∗
Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
Department of Mathematics, Ryerson University, Toronto, ON, Canada
‡
Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
†
1
models which overcomes this deficiency is the family of spatial (or geometric) models,
wherein the nodes are embedded in a metric space. A node’s position—especially in
relation to the others—has real-world meaning: the character of the node is encoded
in its location. Similar nodes are closer in the space than dissimilar nodes. This metric
space has many potential meanings: in communication networks, perhaps physical
distance; in a friendship graph, an interest space; in the World Wide Web, a topic
space. As an illustration, a node representing a webpage on pet food would be closer
in the metric space to one on general pet care than to one on travel.
The Spatial Preferrential Attachment Model [1], designed as a model for the World
Wide Web, is one such spatial model. Indeed, as its name suggests, the SPA Model
combines geometry and preferential attachment. Setting the SPA Model apart is the
incorporation of ‘spheres of influence’ to accomplish preferential attachment: the greater
the degree of the node, the larger its sphere of influence, and hence the higher the
likelihood of the node gaining more neighbours. The SPA model produces scale-free
networks, which exhibit many of the characteristics of real-life networks (see [1, 4]).
In [12], it was shown that the SPA model gave the best fit, in terms of graph structure,
for a series of social networks derived from Facebook.
As the motivation behind spatial models is the ‘second layer of meaning’—the character of the nodes as represented by their positions in the metric space—we hope to
uncover this layer through examination of the link structure. In particular, estimating
the distance between nodes in the metric space forms the basis for two important link
mining tasks: finding entities that are similar—represented by nodes that are close together in the metric space—and finding communities—represented by spatial clusters
of nodes in the metric space. We show how a theoretical analysis of a spatial model
can lead to reliable tools to extract the ‘second layer of meaning’.
The majority of the spatial models to this point have used uniform random distribution of nodes in the space. However, considering the real-world networks these models
represent, this concept does not capture the following essential aspect of real-life data.
Indeed, on a basic level, if the metric space represents actual physical space, and the
nodes people, then we note that people cluster in cities and towns, rather than being
uniformly spread across the land. More abstractly, there are more webpages on a popular topic, corresponding to a small area of our metric space, than for a more obscure
topic. The development of spatial network models naturally then begins to incorporate
varying densities of node distribution: both ‘clumps’ of higher/lower density, as well
as gradually changing densities, are possibilities. One of the more important goals is
that of community recognition: the discovery and quantification of characteristically
(semantically) similar nodes.
In this work we generalize the SPA model to an inhomogeneous distribution of
nodes within the space. We assume distinct regions of different densities, where the
dense regions are the ‘clusters’. We find that the local regions behave almost as if
generated by independent SPA models of parameters derived from the densities. Many
earlier results from the SPA Model then translate easily to this inhomogeneous version
and we begin the process of uncovering the geometry using link analysis.
2
In the remainder of this section, we first review related work, and then we give a
formal definition of the SPA model. In Section 2 we state our main theoretical results.
In particular, we give the typical behaviour of the in-degree of a node, and use this to
derive a relationship between spatial distance and number of common neighbours of a
pair of nodes. The proofs of the theorems are given in Section 4.
In Section 3 we verify the asymptotic results from Section 2 through a simulation
of the SPA model to generate large graphs. Specifically, we show how the relationship
between spatial distance and common neighbours can be used to devise a distance
estimator which gives precise results. We also use the theoretical results to estimate
the local density around a node. Our simulations show that these estimators give
reliable results on the simulated data.
1.1
Background and Related Work
Efforts to extract node information through link analysis began with a heuristic quantification of entity similarity: numerical values, obtained from the graph structure,
indicating the relatedness of two nodes. Early simple measures of entity similarity,
such as the Jaccard coefficient [19], gave way to iterative graph theoretic measures, in
which two objects are similar if they are related to similar objects, such as SimRank [15].
Many such measures also incorporate co-citation, the number of common neighbours
of two nodes, as proposed in the context of bibliographic research in an early paper by
Small [20]. In [8], the authors make inferences on the social space for nodes in a social
network, using Bayesian methods and maximum likelihood.
Generative spatial models were proposed in a more general setting, where the main
objective was to generate graphs with properties that correspond to those observed in
real-life networks. Different approaches were explored, for example in [2] using thresholds, or in [5, 6] using a geometric variant of the preferential attachment. Graph properties of this model were analyzed by Jordan in [16]; follow-up work on this model can
be found in [7, 18]. In [17], a non-uniform distribution of the points in space is considered. In [10], Jacob and Mörters propose a probabilistic spatial model where the link
probability is a function decreasing with distance. The setting is general, and includes
the SPA model as a special case. Follow-up work on this model can be found in [9].
The SPA model was first proposed in [1] as a model for the World Wide Web.
In [1] and [4], it was proved that the SPA model produces graphs with certain graph
properties that correspond to those observed in real-life networks. The authors’ previous
paper, [14], used common neighbours to explore the underlying geometry of the SPA
model and quantify node similarity based on distance in the space. However, the
distribution of nodes in space was assumed to be uniform. The approach used in this
paper is similar to that in [14], but we investigate the complications that arise when
the distribution is non-uniform, which is clearly a more realistic setting.a
An earlier version of this work, containing no proofs, was presented at the workshop
WAW 2013. An extended abstract can be found in [13].
3
1.2
The Inhomogeneous SPA Model
We begin with a brief description of our inhomogeneous SPA model. The model presented here is a generalization of the SPA model introduced in [1], the main difference
being that we allow for an inhomogeneous distribution of nodes in the space.
Let S be the unit hypercube in Rm , equipped with the torus metric derived from the
Euclidean norm, or any equivalent metric. The nodes {vt }nt=1 of the graphs produced
by the SPA model are points in S chosen via an m-dimensional point process. Most
generally, the process
R is given by a probability density function ρ; ρ is a measurable
function such that S ρdµ = 1, where µ is the Lebesque measure.R Precisely, for any
measurable set A ⊆ S and any t such that 1 ≤ t ≤ n, P(vt ∈ A) = A ρdµ.
In fact, we will restrict ourselves to probability functions that are locally constant.
Precisely, we assume that the space S = [0, 1)m is divided into Lm equal sized hypercubes, where L is a constant natural number. Each hypercube is of the form
Ij1 × Ij2 × · · · × Ijm (0 ≤ j1 , j2 , . . . , jm < L), where Ij = [j/L, (j + 1)/L). Note
that any density function ρ can be approximated by such a locally constant function,
so that this restriction is justified.
To keep notation as simple as possible, we assume that each hypercube is labelled
R` , 1 ≤ ` ≤ Lm . Let ρ` be the density of R` , so the density function has value ρ`
on R` . For any node v, let R(v) be the hypercube containing v, and let ρ(v) be the
density of R(v). Clearly, every hypercube has volume L−m . Then the probability that
a node vt , introduced at time t, falls in R` equals q` = ρ` L−m , P
and the expected number
−m
of points in R` equals q` n = ρ` L n. It is easy to see that ` q` = 1. Thus we may
model the point process as follows: at each time step t, one of the regions is chosen as
the destination of vt ; region R` is chosen with probability q` . Then, a location for vt is
chosen uniformly at random from the chosen region R` .
The SPA model generates stochastic sequences for graphs {Gt }t≥0 ; for each t ≥ 0,
Gt = (Vt , Et ), where Et is an edge set, and Vt ⊆ S is a node set. The in-degree of a
node v at time t is given by deg− (v, t). Likewise the out-degree is given by deg+ (v, t).
The sphere of influence S(v, t) of a node v at time t is defined as the ball, centred at v,
with total volume
A1 deg− (v, t) + A2
,
|S(v, t)| =
t
where A1 , A2 > 0 are given parameters. If (A1 deg− (v, t) + A2 )/t ≥ 1, then S(v, t) = S
and so |S(v, t)| = 1. The generation of a SPA model graph begins at time t = 0 with G0
being the null graph. At each time step t ≥ 1 (defined to be the transition from Gt−1 to
Gt ), a node vt is chosen from S according to the given spatial distribution, and added to
Vt−1 to form Vt . Next, independently, for each node u ∈ Vt−1 such that vt ∈ S(u, t − 1),
a directed link (vt , u) is created with probability p, p ∈ (0, 1) being another parameter
of the model. We impose the additional restriction that pA1 max` ρ` < 1; this avoids
regions becoming too dense. This property will be always assumed.
Let δ(v) be the distance from v to the boundary of R(v). Let r(v, t) be the radius of
the sphere of influence of node v at time t. So if r(v, t) ≤ δ(v), then S(v, t) is completely
4
contained in R(v) at time t. We see that
1/m
r(v, t) = (|S(v, t)|/cm )
=
A1 deg− (v, t) + A2
cm t
1/m
,
where cm is the volume of the unit ball; for example, in 2-dimensions with the Euclidean
metric, c2 = π.
As typical in random graph theory, we shall consider only asymptotic properties
of Gn as n → ∞. We say that an event in a probability space holds asymptotically
almost surely (a.a.s.) if its probability tends to one as n goes to infinity. We emphasize
that the notations o(·) and O(·) refer to functions of n, not necessarily positive, whose
growth is bounded. Since we aim for results that hold a.a.s., we will always assume
that n is large enough.
2
Graph properties of the SPA model
In this section we investigate typical properties of graphs produced by the inhomogeneous SPA model, aiming to use the results to infer the spatial distances between the
nodes. A central observation is that in the inhomogeneous SPA model with a locally
constant density function, the probability of an edge forming from a new node vt to an
existing node v at time t equals
Z
X
ρ` |S(v, t) ∩ R` |.
P (vt , v) ∈ E(Gn ) = p
ρdµ = p
S(v,t)
`
In the analysis of the original SPA model from [1], we find that spheres of influence
of nodes that are born early typically shrink rapidly, while nodes born late start with
small spheres of influence. A node would have to be quite close to the boundary of
its region with another one for the effect of any other region to be felt. With this
assumption, the expression for the link probability is very similar to that of the link
probability of the original SPA model. Therefore, it seems reasonable to expect that the
graph formed by nodes in a region R` with local density ρ` behaves like an independent
SPA model of density ρ` . Our results will show that this expectation is justified and
can be made rigorous.
To be specific, assume that nodes in the SPA model do not arrive at fixed, discrete,
time instances t, but instead arrive according to a homogeneous Poisson process with
rate 1. (This will not significantly change the analysis but is a convenient assumption.)
Then, the process inside a region R with density ρ will behave like the SPA model with
the same parameters A1 , A2 and p, but with points arriving according to a Poisson
process with rate ρ. This means that in each time interval we expect ρ points to arrive,
and the expected time interval between arrivals equals 1/ρ. If we use vt to denote the
t-th node arriving, then the arrival time a(t) of vt is approximately t/ρ, and thus the
5
volume of the sphere of influence of an existing node v at the time that vt is born equals
ρA1 deg− (v, a(t)) + ρA2
A1 deg− (v, a(t)) + A2
≈
.
a(t)
t
|S(v, a(t))| =
Thus, in the analysis of the degree of an individual node, we expect a node v in the
inhomogeneous SPA model to behave like a node in the original SPA model with parameters ρ(v)A1 , ρ(v)A2 instead of A1 , A2 , where the degree of node v at time t in
the inhomogeneous SPA model corresponds to the degree of a node at time a(t) in the
corresponding SPA model. The following theorems show that this is indeed the case.
Theorem 2.1. Let ω = ω(n) be any function tending to infinity together with n, and
let ε > 0. The following holds with probability 1 − o(n−1 ). For every node v for which
deg− (v, n) = k = k(n) ≥ ω 2 log n
and for which
δ(v) ≥ (1 + ε)
A1 k + A2
cm n
1/m
= (1 + ε)r(v, n),
(1)
it holds that for all values of t such that max{tv , Tv } ≤ t ≤ n,
pρ(v)A1
t
deg (v, t) = (1 + o(1))k
.
n
−
(2)
Times Tv and tv are defined as follows:
Tv = n
ω log n
k
1
pρ(v)A
1
,
tv = (1 + ε)
A1 k
m
δ(v) cm npρ(v)A1
1
1−pρ(v)A
1
.
(3)
Condition (1) on δ(v) ensures that at time n, S(v, n) is completely contained in
R(v) (deterministically). In fact, due to the additional multiplicative factor of (1 + ε),
S(v, n) is some distance removed from the boundary of R(v). The expression for Tv
is chosen so that at this time node v has a.a.s. at least ω log n neighbours. Likewise,
tv is chosen such that at this time a.a.s. the sphere of influence has shrunk so that
its radius is sufficiently smaller than δ(v), again with some extra room to spare. The
implication of this theorem is that once a node accumulates at least ω log n neighbours
and its sphere of influence has shrunk so that it does not intersect neighbouring regions,
its behaviour can be predicted with high probability until the end of the process, and
is completely governed by its region, and no others. In particular, it follows that from
time max{tv , Tv } onwards the sphere of influence is completely contained in R(v).
For most vertices, the moment when they first achieve ω log n neighbours (Tv ,) will
come before the moment that their sphere of influence has shrunk so that it is well
contained in the region (tv ). Indeed, consider a vertex v of degree at least ω log n for
which this is not the case. Let T be the moment when the vertex reaches in-degree
6
ω log n. By definition, the sphere of influence of v at this time T has a radius of influence
n 1/m
of order ω log
. If T ω log n, then the radius is o(1), and the probability that
T
v is this close to the border is also o(1). The only vertices for which potentially the
radius at time T could be fairly large are those vertices for which T = O(ω log n). Thus,
these are the oldest vertices. These vertices do have high degree, but their spheres of
influence still tend to shrink over time, so most of their edges will be acquired after
time tv , that is, when their sphere of influence has shrunk to be contained in the region.
We can use the results on the degree to show that each graph induced by one of
the regions R` has a power law degree distribution. Let N` (j, n) denote the number
of nodes of degree j at time n in the region R` . The proof of the following result is
a straightforward adaptation of the differential equations method used to prove the
counterpart result for the uniform model (see [1]). Since this theorem is not needed to
prove the main result of this paper, the proof is omitted here.
Theorem 2.2. A.a.s. the graph induced by the nodes in region R` has a power law
degree distribution with coefficient 1 + 1/(pρ` A1 ). Precisely, a.a.s. for any 1 ≤ ` ≤ k m
pρmax A1
there exists a constant c` such that for any 1 j ≤ jf = n/ log8 n 4pρmax A1 +2 ,
N` (j, n) = (1 + o(1))c` j
−(1+ pρ 1A )
`
1
q` n.
Moreover, a.a.s. the entire graph generated by the inhomogeneous SPA model has a
degree distribution whose tail follows a power law with coefficient 1 + 1/(pρmax A1 ).
The number of edges also validates our hypothesis that a region of a certain density
behaves almost as a uniform SPA model with adjusted parameters. In the original SPA
model with parameters A1 and A2 replaced by ρA1 , ρA2 and p, the average out-degree
pρA2
, as per [1, Theorem 1.3]. The following theorem shows that the
is approximately 1−pρA
1
subgraph induced by one of the regions has the equivalent expected number of edges.
This theorem also shows that a.a.s. the number of edges that cross the boundary of a
region is of smaller order than the number of edges completely contained in that region.
Thus, almost all edges have both endpoints in the same region.
Theorem 2.3. A.a.s., for all regions R` of density ρ` , |V (Gn ) ∩ R` | = (1 + o(1))q` n.
Moreover,
E({(u, v) ∈ E(Gn ) | u, v ∈ R` }|) = (1 + o(1))
pρ` A2
q` n.
1 − pρ` A1
Furthermore, a.a.s.
|{(u, v) ∈ E(Gn ) : R(u) 6= R(v)}| = o(n).
Here we see that we need the condition pρmax A1 < 1. If pρmax A1 ≥ 1, then the
number of edges would grow superlinearly.
7
Our ultimate goal is to derive the pairwise distances between the nodes in the metric
space through an analysis of the graph. The following theorem, obtained using the
approach of [14], provides an important tool. Namely, it links the number of common
in-neighbours of a pair of nodes to their (metric) distance. Using this theorem, we can
then infer the distance from the number of common in-neighbours.
The theorem distinguishes three cases. If u and v are relatively far from each other,
then they will have no common neighbours. If the nodes are very close, then the number
of common neighbours is approximately equal to a fraction p of the degree of the node
of smallest degree. The third case provides a ‘sweet spot’ where the number of common
neighbours is a direct function of the metric distance and the degrees of the nodes. For
any two nodes u and v, let cn(u, v, t) denote the number of common in-neighbours of u
and v at time t.
Theorem 2.4. Let ω = ω(n) be any function tending to infinity together with n, and let
ε > 0. The following holds a.a.s. Let u and v be nodes of final degrees deg(u, n) = k and
deg(v, n) = j such that R = R(u) = R(v), and k ≥ j ≥ ω 2 log n. Let ρ = ρ(v) = ρ(u)
1
n pρA1
, and assume that
and let Tv = n ω log
j
m
m
δ(v) ≥ cj and δ(u) ≥ ck,
where c = (1 + ε)
A1
pρA
cm n 1 Tv1−pρA1
.
Let d(u, v) be the distance between u and v in the metric space. Then, we have the
following result about the number of common in-neighbours of u and v:
Case 1. If
d(u, v) ≥ ε
(ω log n)(k/j)
Tv
1/m
then cn(u, v, n) = O(ω log n).
Case 2. If
d(u, v) ≤
A1 k + A2
cm n
1/m
−
A1 j + A2
cm n
1/m
then cn(u, v, n) = (1 + o(1))pj.
Case 3. If
A1 k + A2
cm n
1/m
−
A1 j + A2
cm n
1/m
< d(u, v) < ε
then
cn(u, v, n) = Cjn
−α α −mα
k d
8
(ω log n)(k/j)
Tv
1/m !!
j
1 + o(1) + O
,
k
1/m
,
(4)
where d = d(u, v) and
α=
pρA1
1 − pρA1
and
C = pAα1 c−α
m .
Note that, if j k, then we have a precise asymptotic formula for cn(u, v, n). If
j and k are approximately equal, then the formula only states that cn(u, v, n) =
Θ(jn−α k α d−mα ).
3
Reconstruction of Geometry
We set out to discover the character of nodes in a network purely through link structure,
and to quantify the similarities. Spatial models allow us a convenient definition of
similarity: distances between nodes. In examining the SPA model, the number of
common neighbours allows us to uncover a good approximation of pairwise distances,
a first step in the reconstruction of the geometry.
Description of Model Used: For simulations, we use an inhomogeneous SPA model
that we call a diagonal layout, which has 4 ‘clusters’ of identical high density, with
m = 2. In the diagonal layout, k = 4 and the 4 regions (x, x), 1 ≤ x ≤ 4, are dense,
with the others sparse. We will use ‘dense region’ and ‘sparse region’ to denote the
union of all regions with densities ρd and ρs , respectively. For ease of notation, we note
1
that 16
(4ρd + 12ρs ) = 1, so ρs = 4/3 − ρd /3. Thus it is enough to provide the value of
ρd only. In Figure 1 we see an example of the diagonal layout with nodes and edges,
and we also see evidence that the densest region does dominate the power law degree
distribution. The yellow line is the prediction for the degree distribution with the power
law exponent based on the maximum density, as in Theorem 2.2.
Our estimator for the distance is derived from Case 3 of Theorem 2.4, and in particular Equation (4), ignoring the error term. This leads to the following formula for the
ˆ For a pair of nodes u, v with deg− (u, n) = k and deg− (v, n) = j,
estimated distance d.
k ≥ j, whose distance is such that Case 3 applies, this estimate is given by:
ˆ v) =
d(u,
pγ A1 j γ k
ncm cn(u, v, n)γ
m1
,
(5)
1
(note that γ = α1 with α as in Equation (4)). If the density is
where γ = 1−pρ(u)A
pρ(u)A1
uniform, that is, if ρ(u) = 1 for all u, then the above estimator is the same as our
original estimator: Equation (7) from [14].
Since a relationship between the spatial distance only exists in Case 3 of Theorem 2.4, we try to eliminate pairs to which one of the other cases applies. Pairs which
are in Case 1 are very close, and for such pairs, the expected number of common neighbours is p deg(v, n) = pj. In an attempt to avoid this case, we filter out all pairs where
the number of common neighbours is greater than pj/2. Pairs that are in Case 2 are
9
16
1
14
Logarithm of Number of Vertices
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Observed
Theoretical Slope
12
10
8
6
4
2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0
1
1
2
3
4
Logarithm of Degree
5
6
Figure 1: Left: diagonal layout, n = 1, 000, p = 0.6, ρd = 1.6, A1 = 0.7, A2 = 2.0;
Right: degree distribution n = 1, 000, 000, p = 0.7, ρd = 1.2, A1 = 0.7, A2 = 1.0.
so far apart that their spheres of influence have overlap for a very short time, if at all.
We try to avoid this case by eliminating pairs with 10 or fewer common neighbours.
To see the effect of the non-uniform density, we first apply the original estimator to
our diagonal layout. In other words, we define the estimated distance as in Equation (5),
but taking ρ(u) = 1 for all u, and we are applying this estimator to the points obtained
from a non-uniform distribution. The motivation of this experiment is that, when
applying our techniques to real-life data, we are not likely to know the local density
of a node. Figure 2 (left side) gives the estimated versus real distance for a graph
with n = 100, 000 nodes, generated via the SPA model from the diagonal layout with
parameters p = 0.7, ρd = 1.6, A1 = 0.7, A2 = 2.0. After filtering as described above,
2,270 pairs are left.
The figure shows that the approach of assuming uniform density leads to a consistent
overestimate of the distance for the nodes. This may seem counterintuitive. The trouble
lies with the estimator’s assumption about a node’s age, which is based on its final indegree. A node in Rd has more neighbours than is expected when one assumes uniform
density, and thus the node is thought to be much older than it actually is. This
confounds the distance estimator.
Using the same simulation results, we now apply the estimator from Equation (5),
and use our knowledge about ρ(u). The figure on the right in Figure 2 shows the
estimated distance using Equation (5) vs. actual node distance. The results indicate
that our new estimator is significantly more accurate in predicting distances for the
pairs of nodes in the dense region.
Let us mention that the estimation for pairs in the sparse region is still not accurate,
while the estimation for cross-border pairs appears to be even worse. This is likely
caused by the fact that nodes that are involved in cross-border pairs, and in sparse
region pairs, and that have enough common neighbours to qualify to be included, are
10
Estimated Distance
Estimated Distance
0.1
0.05
Cross Pairs
Sparse Region
0.1
0.05
Sparse Region
Dense Region
Dense Region
Perf. Est
0
0
0.05
0.1
Actual Distance
Cross Pairs
Perf. Est
0.15
0
0
0.05
0.1
Actual Distance
0.15
Figure 2: SPA model, n = 100, 000, diagonal layout, p = 0.7, ρd = 1.6, A1 = 0.7, A2 =
2.0, actual vs. estimated distances for pairs of nodes; Left: using original estimator;
Right: using new estimator, density known
likely the older nodes, i.e. they are born near the beginning of the process. For such
nodes, in the early stages there is likely some overlap between their sphere of influence
and the bordering, dense regions. Thus, the degree likely does not follow the prediction
from Theorem 2.1, which, in turn, affects the performance of the distance estimator for
those pairs.
Better performance for cross-border pairs could possibly be obtained by using a
linear combination of the densities in Equation (5). However, we will see in what
follows that better performance for all pairs occurs when we use the data itself to
estimate the density. Also, we point out that the pairs in the dense region constitute
the large majority of all pairs. Moreover, the dense regions are those that are most
likely to correspond to communities of interest. Therefore, accurate prediction for pairs
in these regions is most important.
Estimating the density: In real-world situations, we cannot assume to know the
density of the region containing a given node. In fact, the density of the region containing a node is an important part of the ‘second layer of meaning’ which we aim to
extract from the graph. Here we will show that our theoretical results give us a tool
for estimating the local density around a node, using only its neighbourhood. We also
apply our distance estimator once more, this time using the estimated density for our
formula.
Using the theoretical results obtained from the previous section, we can estimate the
density of the region R(v) containing a given node v from the average out-degree of the
in-neighbours of v. As per Theorem 2.3, the average out-degree in R` is approximately
pρ` A2
.
1 − pρ` A1
(6)
If we have a large enough set of nodes from the same region, then we can use the
formula above to estimate the density of the region. Consider a node v, and make
11
two assumptions: (i) almost all neighbours of v are contained in R(v), and (ii) the
neighbours of v form a representative sample of all nodes of R(v). Simulations show that
these assumptions are justified and allow us to make an estimate for ρ(v). Assumption
(i) is additionally justified by the second part of Theorem 2.3, which states that the
number of edges crossing the border is negligible compared to the total number of edges.
+
Set deg (v) to be the average out-degree of the in-neighbours of v. Specifically,
+
deg (v) =
1
deg− (v)
X
deg+ (u),
u∈N − (v)
where N − (v) is the set of in-neighbours of v. Given our assumptions, an estimator
for the density in R(v), denoted by ρ̂(v), can be derived from this average out-degree,
using Equation (6):
+
deg (v)
.
ρ̂(v) =
+
pA2 + pA1 deg (v)
+
The left side of Figure 3 shows a histogram of the values of deg (v) for our simulated
graph. Displayed are the results for nodes with deg− (v) ≥ 10. The graph is obtained
from the SPA model where points have the previously described diagonal layout, with
density ρd = 1.6 in the dense region, and consequently density ρs = 0.8 in the sparse
+
region. For these parameters, Equation (6) gives a theoretical value of 5.85 for deg (v)
if node v lies in the dense region, and a value of 1.45 if v lies in the sparse region.
+
We see in Figure 3 (left side) that the values of deg (v) in the dense region are
quite accurate, with peaks occurring around the calculated value of 5.85. For the sparse
region, the peaks occur around 2.5, giving an estimate for the average out-degree which
is higher than expected. Likely, this is caused by nodes in the sparse region that are
located close to the border, and thus are likely to have neighbours in the dense region.
Such nodes also tend to have high degree, and our condition on the minimum degree
favours the ‘rich’ sparse region nodes.
Figure 3 (right side) gives a histogram of the estimated densities of the nodes. For
nodes in the dense region, the true value is 1.6, and we see a good estimation of this
value for these nodes. For nodes in the sparse nodes, the true value is 0.8, while the
peak of the estimated densities occurs around 1.15, and almost all values are greater
than 0.8. Again, this is likely caused by nodes whose sphere of influence overlapped
with the dense region.
To obtain better performance for nodes in the sparse region, we propose to base our
estimated density for a node v in the sparse region only on the out-degree of neighbours
of v of low in-degree; such neighbours are young and so the sphere of influence of v
had shrunk, and thus was more likely to be fully inside the sparse region, when the
neighbours were born. To obtain density estimates for nodes with small in-degree, we
can take the second neighbourhood to compute the average out-degree. Nodes with
small in-degrees are young, so even second neighbours are likely to be close. We plan
to explore these possibilities in future work.
12
150
150
Dense Region
Sparse Region
100
Number of Vertices
Number of Vertices
Dense Region
50
0
0
5
10
Average Out−degree of In−Neighbours
50
0
15
Sparse Region
100
0.8
1
1.2
1.4
1.6
Calculated Density
1.8
2
Figure 3: Diagonal layout, n = 100, 000, ρd = 1.6, A1 = 0.7, A2 = 2.0, p = 0.6;
Left: average out-degree of the in-neighbours; Right: calculated density from average
out-degree.
Finally, we use ρ̂, and known values of all other parameters, to calculate the distance
between the nodes based on the number of common neighbours, Equation (5), using
the same simulation results as those we used earlier. Here we use the calculated density
of the node of higher degree in the distance formula. (Using the lower degree node gives
similar results.) The results are seen in Figure 4.
The figure shows that there is very good agreement between calculated and estimated densities. In fact, we see that the agreement is greatly improved for the crossborder pairs, and also better for the sparse region pairs. This can be understood as
follows. The distance estimator is derived indirectly from Theorem 4.1, which predicts
the approximate degree of a node throughout the process, based on its final degree and
the density of its region. For nodes in the sparse region which have a sizeable number
of neighbours in the sparse region, the degree will be larger than predicted using this
method but also, the density estimator will predict a higher density. So the estimated
density is a better indicator of the behaviour of the degree than the real density, and
thus the distance estimator gives better performance. This indicates that this last variation of the distance estimator is the most robust against local fluctuations in density.
Thus we have a good prognosis for the applicability of the estimator on real data, where
such fluctuations are to be expected.
4
Proofs
In this section, we give the proofs of the main theorems. Our results all refer to typical
behaviour of the random SPA model process, and are asymptotic in n, the number of
vertices. We will sometimes use the stronger notion of w.e.p. in favour of the more
commonly used a.a.s., since it simplifies some of our proofs. We say that an event holds
with extreme probability (w.e.p.), if it holds with probability at least 1−exp(−ω(n) log n)
13
Cross Pairs
Estimated Distance
Sparse Region
0.1
Dense Region
Perf. Est
0.05
0
0
0.05
0.1
Actual Distance
0.15
Figure 4: Diagonal layout, n = 100, 000, ρd = 1.6, A1 = 0.7, A2 = 2.0, p = 0.7:
Distance estimation using estimated density from the node of greater final degree, all
other parameters known
as n → ∞, where ω(n) is any function tending to infinity together with n. Thus, if we
consider a polynomial number of events that each holds w.e.p., then w.e.p. (and hence
also a.a.s.) all events hold.
First we state and prove a theorem that bounds the in-degree of any node, regardless
of its distance of the boundary.
Theorem 4.1. Let ω = ω(t) be any function tending to infinity together with t. The
expected in-degree at time t of a node vi born at time i ≥ ω satisfies
pρmax A1
A2
A2 t
−
−
.
E(deg (vi , t)) ≤ (1 + o(1))
A1 i
A1
p(2−m ρ(v)+(1−2−m )ρmin )A1
A2 t
A2
−
E(deg (vi , t)) ≥ (1 + o(1))
−
.
A1 i
A1
Moreover, for any node vi born at time i ≥ 1 we have
pρmax A1
A2
eA2 t
−
E(deg (vi , t)) ≤
−
.
A1 i
A1
Proof. In order to simplify calculations, we make the following substitution:
Y (vi , t) = deg− (vi , t) +
t
A2
=
|S(vi , t)|.
A1
A1
It follows immediately from the definition of the process that Y (vi , i) = A2 /A1 and for
t≥i
(
Y (vi , t) + 1, with probability at most pρmax At1 Y (vi , t),
Y (vi , t + 1) =
Y (vi , t),
otherwise.
14
We couple Y (vi , t) with another random variable X(vi , t) so that Y (vi , t) ≤ X(vi , t)
for t ≥ i. Random variable X(vi , t) is defined as follows: X(vi , i) = A2 /A1 and for t ≥ i
(
X(vi , t) + 1, with probability pρmax At1 X(vi , t)
X(vi , t + 1) =
X(vi , t),
otherwise.
Finding the conditional expectation,
pρmax A1
.
E(X(vi , t + 1) | X(vi , t)) = X(vi , t) 1 +
t
Taking expectations, we get
pρmax A1
E(X(vi , t + 1)) = E(X(vi , t)) 1 +
t
,
and, since X(vi , i) = A2 /A1 ,
t−1 pρmax A1
A2 Y
1+
E(X(vi , t)) =
A1 j=i
j
!
t−1
X
A2
pρmax A1
≤
exp
A1
j
j=i
t
A2
≤
exp pρmax A1 log
+ 1/i
A1
i
pρmax A1
eA2 t
.
≤
A1 i
If i ≥ ω, we have
t−1
X
A2
pρmax A1
E(X(vi , t)) = (1 + o(1)) exp
A1
j
j=i
!
A2
= (1 + o(1))
A1
pρmax A1
t
.
i
This shows the upper bound.
For the lower bound, we first observe that for all nodes v, |S(v, t) ∩ R(v)| ≥
(1/2)m |S(v, t)|. Thus the node vt links to vi with probability at least
p(2−m ρ(v) + (1 − 2−m )ρmin )|S(vi , t)|.
Using this, we can use the exact same approach to bound the expectation of Y (vi , t)
from below. This gives the lower bound.
15
Proof of Theorem 2.1
Here we show that, once a vertex has reached an in-degree of ω log n and its area of
influence is well contained within the region, its degree can be closely predicted with
high probability. We will be using the following version of the Chernoff bound, as seen
in e.g. [11, p. 27, Corollary 2.3].
Lemma 4.2. Let X be a random variable
that can be expressed as a sum of independent
Pn
random indicator variables, X = i=1 Xi , where Xi ∈ Be(pi ) with (possibly) different
pi = P(Xi = 1) = EXi . If ε ≤ 3/2, then
2
ε EX
.
(7)
P(|X − EX| ≥ εEX) ≤ 2 exp −
3
Let us start with the following key lemma.
Lemma 4.3. Let ω = ω(n) be any function tending to infinity together with n, and let
ε > 0. For a given node v, suppose that deg− (v, T ) = d ≥ ω log n and that
δ(v) ≥ (1 + ε)
A1 d + A2
cm T
1/m
= (1 + ε)r(v, T ).
Then, with probability 1 − O(n−3 ), for every value of t, T ≤ t ≤ 2T ,
pρ(v)A1 t
tp
3
−
·
d log n.
deg (v, t) − d ·
≤
pρ(v)A1 T
T
Proof. Let ρ = ρ(v). Our goal is to estimate deg− (v, t) − d · (t/T )pρA1 . We will show
that the upper bound holds; the lower bound can be obtained by using an analogous,
symmetric, argument. Note that the assumption on δ(v) implies that S(v, T ) ⊆ R(v).
We use the following stopping time
!
)
(
pρA1
p
3
t
t
+
·
d log n
∨ (t = 2T + 1)) .
T0 = min t ≥ T : deg− (v, t) > d ·
T
pρA1 T
Note that if T0 = 2T + 1, then the in-degree of v remained bounded as required during
the entire time interval T ≤ t ≤ 2T . Hence, in order to prove the bound, we need to
show that with probability 1 − O(n−3 ) we have T0 = 2T + 1.
Suppose that T0 ≤ 2T . Note that for t ≥ T up to and including time-step T0 − 1,
the random variable deg− (v, t) is (deterministically) bounded from above. Moreover, it
is straightforward to see that this upper bound, together with the assumption on δ(v)
(note the additional multiplicative (1 + ε) term), implies that S(v, t) ⊆ R(v) for all
T ≤ t ≤ T0 − 1. Hence, the number of new neighbours accumulated during this phase
16
of theP
process, deg− (v, T0 ) − d, can be (stochastically) bounded from above by the sum
T0 −1
X = t=T
Xt of independent indicator random variables Xt , where
pρA1
√
3
t
+ pρA
·
d
log
n
+ A2
A1 d Tt
T
1
.
P(Xt = 1) = pρ
t
Clearly, since pρA1 < 1,
EX =
TX
0 −1
EXt
t=T
= pρA1 dT −pρA1
TX
0 −1
!
tpρA1 −1
t=T
T0
T
pρA1
T0
T
pρA1
= d
= d
+
T0 − T p
3 d log n + O(1)
T
pρA1
T
T0 − T p
3 d log n + O(1)
−d
+
T
T
T0 − T p
3 d log n + O(1).
−d+
T
Since T0 ≤ 2T , the in-degree of v at time T0 failed the desired condition, which implies
that
X ≥ deg− (v, T0 ) − d
!
pρA1
p
T0
T0
3
·
d log n − d
≥
d·
+
T
pρA1 T
3
T0 p
T0 − T p
= EX +
·
d log n −
3 d log n + O(1)
pρA1 T
T
p
≥ EX + 3 d log n,
using again that it is assumed that pρA1 < 1. It follows from the Chernoff bound (7)
that
p
p
P(|X − EX| ≥ 3 d log n) ≤ 2 exp − ε d log n ,
√
where ε = 3 d log n/EX. The maximum value of EX corresponds to T0 = 2T and so
EX ≤ d
2T
T
pρA1
= d(2pρA1
≤ d.
2T − T p
2 d log n + O(1)
T
− 1)(1 + o(1))
−d+
p
So ε ≥ 3 d−1 log n. Therefore, the probability that T0 ≤ 2T is at most 2 exp(−3 log n)
and the proof is finished.
17
Now, with Lemma 4.3 in hand we can get Theorem 2.1.
Proof of Theorem 2.1. Let ω = ω(n) be a function going to infinity with n, and let
ε > 0. Let v be a vertex with final degree k ≥ ω log n, let ρ = ρ(v), and assume that δ =
δ(v) ≥ (1+ε)r(v, n). Let T̂v be the first time that the in-degree of v exceeds (ω/2) log n,
and t̂v be the first time t that the radius of influence r(v, t) ≤ δ(1 + ε/2)−(1−pρA1 )/m .
Moreover, let T = max{Tˆv , t̂v } be the first time that the two events hold. Finally, let
d = deg− (v, T ). We obtain from Lemma 4.3 that, with probability 1 − O(n−3 ),
pρA1 pρA1 3 p −1
t
3 p −1
t
−
1−
d log n ≤ deg (v, t) ≤ d
1+
d log n
d
T
pρA1
T
pρA1
for T ≤ t ≤ 2T . It follows that the degree tends to grow but the sphere of influence
tends to shrink between T and 2T , and thus that the conditions of Lemma 4.3 again hold
at time 2T . We can now keep applying the same lemma for times 2T , 4T , 8T , 16T, . . . ,
using the final value as the initial one for the next period, to get the statement for all
values of t from T up to and including time n. Precisely, for 1 ≤ i < imax = blog nc + 1,
pρA1
let di = deg− (v, 2q
(1+εi ),
i T ). Then by Lemma 4.3, we have for i > 1 that di ≤ di−1 2
3
d−1
where εi = pρA
i−1 log n. Since we apply the lemma O(log n) times (for a given
1
vertex v), the following statement holds with probability 1 − o(n−2 ) from time T on:
for any 2i−1 T ≤ t < 2i T , we have that
pρA1 Y
i
t
deg (v, t) ≤ d
(1 + εi ).
T
j=0
−
It remains to make sure that the accumulated multiplicative error term is still only
(1 + o(1)). For that, let us note that
i
i Y
Y
(1 + εi ) =
1+
j=0
j=1
3 p −1 −pρA1 j
d 2
log n
pρA1
i
X
3 p −1
= (1 + o(1)) exp
d log n
2−pρA1 j/2
pρA1
j=1
p
= (1 + o(1)) exp O( d−1 log n)
!
= 1 + o(1),
since d grows faster than log n. A symmetric argument can be used to show a lower
bound for the error term and so the result holds.
It follows that we have the desired behaviour from time T . Precisely, for times
T ≤ t ≤ n, we have that
pρA1
t
−
deg (v, t) = d
(1 + o(1)),
T
18
where d = deg− (v, T ) ≥ deg− (v, T̂v ) ≥ (ω/2) log n. As T = max{T̂v , t̂v }, we need to
consider two cases. Suppose first that T = T̂v . Setting t = n and deg− (v, n) = k, we
obtain that
1/pρA1
1/pρA1
1/pρA1
ω log n
1
d
n = (1 + o(1))
n = (1 + o(1))
Tv .
T = (1 + o(1))
k
2k
2
Therefore, for large enough n, we have that T < Tv ≤ max{tv , Tv }. Suppose then that
T = t̂v . By definition,
r(v, T ) = (1 + o(1))δ(1 + ε/2)−(1−pρA1 )/m
and, since d > ω log n,
r(v, T ) = (1 + o(1))
= (1 + o(1))
A1 d
cm T
1/m
1/m
A1 k
cm T 1−pρA1 npρA1
,
and so
T = (1 + o(1))(1 + ε/2)
A1 k
m
δ cm npρA1
1/(1−pρA1 )
= (1 + o(1))
1 + ε/2
tv .
1+ε
Again, for large enough n, we have that T < tv ≤ max{tv , Tv }. In either case, T <
max{tv , Tv }. As a result, we obtain that, for max{tv , Tv } ≤ t ≤ n,
pρA1
t
−
(1 + o(1)).
deg (v, t) = k
n
Finally, since the statement holds for any vertex v with probability 1 − o(n−2 ), with
probability 1 − o(n−1 ) the statement holds for all vertices. The proof of the theorem is
finished.
Let us note that Theorem 2.1 immediately implies the following two corollaries.
Corollary 4.4. Let ω = ω(n) be any function tending to infinity together with n, and
let ε > 0. The following holds with probability 1 − o(n−1 ). For every node v, and for
every time T so that deg− (v, T ) ≥ ω log n and (1 + ε)r(v, T ) ≤ δ(v), for all times t,
T ≤ t ≤ n,
pρ(v)A1
t
−
−
deg (v, t) = deg (v, T )
(1 + o(1)).
T
Corollary 4.5. Let ω = ω(n) be any function tending to infinity together with n. The
following holds with probability 1 − o(n−1 ). For any node vi born at time i ≥ 1, and
i ≤ t ≤ n we have that
pρmax A1
t
deg(vi , t) ≤ ω log n
.
(8)
i
19
Proof. The statement is trivially true if deg(vi , t) ≤ ω log n so we may assume that t is
such that deg(vi , t) > ω log n. Let T be the first time that the in-degree of v exceeds
(ω/2) log n; clearly, deg− (v, T ) = (1/2 + o(1))ω log n. First assume that (1 + ε)r(v, T ) <
δ(v) for some ε > 0. By Corollary 4.4,
pρmax A1
pρ(v)A1
t
t
−
−
(1 + o(1)) ≤ ω log n
,
deg (v, t) = deg (v, T )
T
i
where the last step follows since we may assume that n large enough so that the (1+o(1))
term is less than 2, and the fact that T > i and ρ ≤ ρmax .
However, even if v is close to the border of the region, that is, (1+o(1))r(v, T ) ≥ δ(v),
this argument applies. Namely, in this case the degree of v is stochastically bounded
above by the degree of a node with the same birth time, but born in a region with
density ρmax , and position far from the border. Therefore, the argument above still
applies.
Proof of Theorem 2.3
Let Zj be the indicator variable of the event {vj ∈ R` }. By definition of the process,
Zj is a Bernouilli variable with expectation q` , and
Pn the variables Zj are independent
for different values of j. Thus, |V (Gn ) ∩ R` | = j=1 Zj is the sum of n independent
binomial random variables. Thus E(|V (Gn )∩R` |) =√q` n, and it follows from Lemma 4.2
that for every `, a.a.s. ||V (Gn ) ∩ R` | − q` n| = O(ω n) = o(n), where ω = ω(n) is any
function tending to infinity together with n. Since the number of regions is assumed to
be a constant independent of n, the desired property holds a.a.s. for all regions.
For the second statement, we examine the number of edges whose endpoints are in
`
the same
region, that is, the number
of edges that do not cross a boundary. We set Mt
to be {(u, v) ∈ E(Gt ) | u, v ∈ R` }. We create the indicator variable Xi,j , such that
(
1, if vi , vj ∈ R` and (vi , vj ) ∈ E(Gn )
Xi,j =
0, otherwise.
Thus,
X
`
|Gt ) = Mt` +
E(Mt+1
E(Xt+1,j |Gt )
vj ∈R` ,j≤t
X
= Mt` +
pρ` |S(vj , t) ∩ R` |
vj ∈R` ,j≤t
X
= Mt` +
pρ` |S(vj , t)| − pρ` Yt` ,
vj ∈R` ,j≤t
where
Yt` =
X
|S(vj , t) \ R` |,
vj ∈R` ,j≤t
20
(9)
Using the definition of S(vj , t), we obtain
`
E(Mt+1
|Gt )
=
Mt`
+
vj ∈R`
= Mt` +
A1 deg− (vj , t) + A2
pρ`
− pρ` Yt`
t
,j≤t
X
pρ` A1 (Mt` + Zt` ) pρ` A2
+
|Vt ∩ R` | − pρ` Yt` ,
t
t
(10)
where
Zt` = |{(u, v) ∈ E(Gt ) | v ∈ R` , u 6∈ R` }|.
Let α = pρmax A1 . By Corollary 4.5, for any function ω = ω(n) tending to infinity
together with n, with probability 1−o(n−1 ), for all vertices vi and for all times i ≤ t ≤ n,
α
t
−
.
deg (vi , t) ≤ ω log n
i
Let G be the event that we have these upper bounds for all vertices vi and for all times
i ≤ t ≤ n.
Assume first that G holds. It follows (deterministically) that for all vertices vi and
for all times i ≤ t ≤ n,
A1 deg− (vi , t) + A2
≤ (ω 2 log n)tα−1 i−α ,
|S(vi , t)| =
t
and thus
r(vi , t) ≤ (c−1/m
ω 2/m log1/m n)t(α−1)/m i−α/m ≤ (ω 3 log1/m n)t(α−1)/m i−α/m ,
m
as usual, assuming that n is large enough. Let Xi be the indicator variable of the event
{S(vi , t) 6⊆ R` }. Using the bound on |S(vi , t)|, we obtain
X
X
Yt` ≤
Xi |S(vi , t)| ≤
Xi (ω 2 log n)tα−1 i−α .
i≤t
i≤t
If Xi = 1 then vi has distance at most (ω 3 log1/m n)t(α−1)/m i−α/m from the boundary of
R` . Since the length of the boundary of R` is at most 4, the boundary strip in which
vj must be located has area at most (4ω 3 log1/m n)t(α−1)/m i−α/m . Thus
E(Xi |G) ≤ (4ω 3 log1/m n)t(α−1)/m i−α/m .
Combining this with the bound on Yt` , we obtain
X
E(Yt` |G) ≤
(4ω 3 log1/m n)t(α−1)/m i−α/m (ω 2 log n)tα−1 i−α
i≤t
1
= (4ω 5 log1+1/m n)t−(1−α)(1+ m )
X
i≤t
5
≤ c(ω log
2+1/m
−β
n)t
21
,
1
i−α(1+ m )
where β = min{ m1 , (1 − α)(1 + m1 )} and c is an appropriate constant independent of t
and n. Note that β > 0, and thus, for t ≥ log4/β n, the expectation of Yt` is o(1) and so
negligible (as ω can be tending to infinity arbitrarily slowly).
Next, consider the number of cross-border edges, Zt` . At time t, an edge from vt
to vi , i < t, which contributes to Zt` can only be created if vt ∈ S(vi , t) \ R` . The
probability that such an edge is created is at most pρmax |S(vi , t) \ R` |. Thus we have
that
X
`
`
`
E(Zt` |Zt−1
) = Zt−1
+
pρmax |S(vi , t) \ R` | = Zt−1
+ pρmax Yt` .
vi ∈R`
Therefore,
E(Zt` |G)
=
t
X
pρmax E(Yτ` |G) ≤ c0 (log4 n)t1−β ,
τ =1
0
for some constant c independent of t and n.
By taking the expectation of (10) and setting m`t = E(Mt` |G), we obtain the following
recurrence for the (conditional) expected value m`t and t ≥ log1+4/β n,
pρ` A1
`
`
+ pρ` q` A2 + o(1).
(11)
mt+1 = mt 1 +
t
To solve this recurrence, we use the following lemma on real sequences, which is
Lemma 3.1 from [3].
Lemma 4.6. If (αt ), (βt ) and (γt ) are real sequences satisfying the relation
βt
αt + γt ,
αt+1 = 1 −
t
and limt→∞ βt = β > 0 and limt→∞ γt = γ, then limt→∞
Using this lemma, we see that limt→∞
solution of the recurrence (11) is:
m`t
t
m`n = (1 + o(1))
=
αt
t
exists and equals
pρ` A2
q,
1−pρ` A1 `
γ
.
1+β
and thus the (first-order)
pρ` A2
q` n.
1 − pρ` A1
(12)
Here we use that pρ` A1 < 1 as given. Note that the o(1) term in the recurrence and
the lower bound on t only affect the (1 + o(1)) term of the solution.
Finally, since P(Ḡ) = o(n−1 ) and (deterministically) Mn` ≤ n2 , we get
E(Mn` ) = P(G)m`n + P(Ḡ)E(Mn` |Ḡ) = (1 + o(1))m`n + o(n) = (1 + o(1))
This completes the proof.
22
pρ` A2
q` n.
1 − pρ` A1
Proof of Theorem 2.4
Fix ω and ε as in the statement of the theorem. Let u and v be nodes of final degrees
deg− (u, n) = k and deg− (v, n) = j such that k ≥ j ≥ ω 2 log n, where both u and v are
located in a region R with density ρ. Let tv and Tv be defined as in (3). Assume that
the conditions of the theorem hold, that is,
A1
m
m
.
(13)
δ(v) ≥ cj and δ(u) ≥ ck, where c = (1 + ε)
cm npρA1 Tv1−pρA1
Since Tv = n (ω log n/j)1/pρA1 ≤ nω −1/pρA1 = o(n), it follows that
1/m
1/m
A1 k + A2
A1 j + A2
and δ(u) ≥ (1 + ε)
.
δ(v) ≥ (1 + ε)
cm n
cm n
Therefore, the conditions of Theorem 2.1 are satisfied. Thus, from time max{Tv , tv }
until the end of the process we have concentration of the degree of node v, as given
1/(1−pρA1 )
A1 j
by (2). Recall that tv = (1 + ε) cm npρA1 δ(v)m
as in (3). Rewriting condition
(13), we obtain that
1/(1−pρA1 )
A1 j
1/(1−pρA1 )
Tv ≥ (1 + ε)
> tv .
cm npρA1 δ(v)m
Thus max{Tv , tv } = Tv and so, in fact, the degree of v is concentrated from time Tv
until time n.
Similarly, let tu and Tu be defined as in (3). The argument above, with j replaced
by k, shows that max{tu , Tu } = Tu . By definition, as k ≥ j, it follows that Tu ≤ Tv .
Hence, by Theorem 2.1, we have concentration of the degrees of both u and v from
time Tv on. Precisely, we have that, with probability (1 − o(n−1 )), for all pairs u and
v satisfying the conditions as stated above, and for all times t, Tv ≤ t ≤ n,
pρA1
pρA1
t
t
−
−
and deg (v, t) = (1 + o(1))j
. (14)
deg (u, t) = (1 + o(1))k
n
n
The rest of the proof proceeds as the proof of Theorem 3.1 in [14].
Case 1. By (14) above and the definition of Tv and Tu , we have that deg− (v, Tv ) =
(1 + o(1))ω log n, and deg− (u, Tv ) = (1 + o(1))(ω log n)(k/j). This implies that the
radius of influence of u satisfies:
1/m !
(ω log n)(k/j)
r(u, Tv ) = Θ
.
Tv
The condition assumed in this case is that
1/m
(ω log n)(k/j)
d(u, v) ≥ ε
.
Tv
23
Thus, at time Tv , both radii r(u, Tv ) and r(v, Tv ) are O(d(u, v)). Moreover, both radii
decrease (in order) from time Tv onwards, whereas d(u, v) is independent of time. Thus
there must exist a constant c (depending on ε but not on n) such that, for all times t,
cTv ≤ t ≤ n, the areas of influence of u and v are disjoint. Then the total number of common neighbours could only be, at most, min{deg− (u, cTv ), deg− (v, cTv )} = O(ω log n).
Case 2. Suppose d satisfies
d(u, v) ≤
A1 k + A2
cm n
1/m
−
A1 j + A2
cm n
1/m
= r(u, n) − r(v, n).
Note that this condition implies (deterministically) that at time n the sphere of influence
of v is contained in the sphere of influence of v. We now see that this situation occurs
in approximate form throughout the process. From (14), we see that, for Tv ≤ t ≤ n,
1/m
deg(v, t) = (1 + o(1)) kj deg(u, t), and thus r(v, t) = (1 + o(1)) kj
r(u, t).
If j ≤ αk for some constant α ∈ (0, 1), then we have that the difference between
1/m
r(v, t) and r(u, t) behaves as (1 + o(1))cr(u, t), where c = kj
≤ α1/m < 1. In
particular, this implies that this difference tends to shrink over time. Thus we have
that, for Tv ≤ t ≤ n,
d(u, v) ≤ r(u, n) − r(v, n) ≤ (1 + o(1))(r(u, t) − r(v, t)).
Therefore, all but a negligible fraction of the sphere of influence of v lies inside the
sphere of influence of u during the whole process.
If k = (1 + o(1))j, so ( kj )1/m = 1 + o(1), the difference between r(v, t) and r(u, t) is
o(r(u, t)) for Tv ≤ t ≤ n, and specifically at time t = n. From the condition of this case,
we know that at time n, S(v, n) is completely contained inside S(u, n). This, combined
with the fact that r(v, t) − r(u, t) = o(r(u, t)) implies that the spheres of influence of u
and v overlap in all but a negligible part during the entire process.
Any vertex w that links to v must lie inside the sphere of influence of v. Since most
of the sphere of influence of v is contained in that of u in both cases mentioned above,
this means that it is likely that w lies inside the sphere of influence of u as well, and
thus has a probability p of also linking to v. Accounting for the small variation in the
size of the spheres of influence, we have that the probability that a neighbour of u,
added between times Tv and n, is also a neighbour of v is (1 + o(1))p. The number
of common neighbours accumulated until time Tv is at most deg(u, Tv ) = O(ω log n),
which is of smaller order than j, the final degree of u.
Therefore, Ecn(u, v, n) = (1 + o(1))pj. Finally, note that the number of common
neighbours is a sum of independent random indicator variables with Bernouilli distribution; each variable corresponds to a situation when a neighbour of v falls into a sphere
of influence of u. It follows that cn(u, v, n) ∈ Bin((1 + o(1))j, p). The concentration
follows from the bound (4.2).
24
Case 3. Suppose that d = d(u, v) satisfies
1/m
(ω log n)(k/j)
−1/m
r(u, n) − r(v, n) < d(u, v) < ε
= (1 + o(1))εA1
r(u, Tv ).
Tv
The analysis for this case is based on the assumption that, at time Tv , the sphere of
influence of v is contained in that of u, and at time n the spheres are disjoint. Thus, in
some narrow time interval around a time t0 between times Tv and n, the spheres become
separated, and after this no more common neighbours can be formed. However, the
conditions of this case do not guarantee that we have this situation: it may be that the
conditions hold, but the sphere of influence of v is not completely contained in that of
u at time Tv , or that the spheres are not completely disjoint at time n. However, the
conditions guarantee that the asymptotic behaviour still applies in this case.
Let t− be the first time instance after Tv when S(u, t) is not completely contained
in S(v, t). Let t+ be the last time when the spheres overlap, or t+ = n if the spheres
overlap at time n. (So Tv ≤ t− ≤ t+ ≤ n). From time Tv until time t− , each neighbour
of u will be a common neighbour of v and u with probability p. From time t+ to n,
no common neighbours can be created. From time t− until time t+ , the probability
that a neighbour of u becomes a neighbour of v is at most p. Thus, p deg− (u, t− ) and
p deg− (u, t+ ) form a lower and an upper bound, respectively, on the expected number
of common neighbours of u and v.
Since the centres of S(u, t) and S(v, t) are at distance d from each other, the definition of t− and t+ translate into the following conditions on the radii of the spheres of
influence:
r(v, t− ) − r(u, t− ) = (1 + o(1))d,
r(v, t+ ) + r(u, t+ ) = (1 + o(1))d.
(The factor (1 + o(1)) is caused by the fact that the spheres of influence increase or
decrease in discrete amounts.)
Using Equation (14) for the degree of u and v from time Tv , and translating this
into conditions on the radius of the sphere of influence, we obtain
r(v, t− ) − r(u, t− )

= (1 + o(1))
 A 1 k


= (1 + o(1))
− pρA1
t
n
cm t−
1/m
+ A2 

A1 −pρA1 − pρA1 −1
kn
(t )
cm
1/m

 A1 j
−
− pρA1
t
n
cm t−
1/m 
+ A2  
 

1/m !
j
1−
.
k
A similar argument shows that
r(v, t+ ) + r(u, t+ ) = (1 + o(1))
A1 −pρA1 + pρA1 −1
kn
(t )
cm
25
1/m
1/m !
j
1+
.
k
Define t0 = (t+ + t− )/2. Then, by the above, we get that t− = t0 (1 − O((j/k)1/m )),
t+ = t0 (1 + O((j/k)1/m )), and
t0 = (1 + o(1))
A1 k −pρA1 −m
d
n
cm
1
1−pρA
1
.
It follows from the discussion from Case 2 that a.a.s. the number of common neighbours of u and v is bounded from below by (1 + o(1))p deg− (u, t− ), and from above by
(1 + o(1))p deg− (v, t+ ). Using our knowledge about the behaviour of the in-degree of u
given by (14), this leads to the following expression:
1/m !!
pρA1
j
t0
1 + o(1) + O
cn(u, v, n) = pj
n
k
pρA1
1−pρA
1/m !!
pρA1
1
A1 −m
j
−
= pjn 1−pρA1
1 + o(1) + O
kd
.
cm
k
This concludes the proof.
References
[1] W. Aiello, A. Bonato, C. Cooper, J. Janssen, and P. Pralat. A spatial web graph
model with local influence regions. Internet Mathematics, 5(1-2):175–196, 2008.
[2] M. Bradonjić, A. Hagberg, and A. Percus. The structure of geographical threshold
graphs. Internet Math., 5:113–140, 2008.
[3] F. Chung and L. Lu. Complex Graphs and Networks. American Mathematical
Society, August 2006.
[4] C. Cooper, A. Frieze, and P. Pralat. Some typical properties of the spatial preferred
attachment model. Internet Mathematics, 10:27–47, 2014.
[5] A. Flaxman, A. Frieze, and J. Vera. A geometric preferential attachment model of
networks. Internet Mathematics, 3(2):187–206, 2006.
[6] A. Flaxman, A. Frieze, and J. Vera. A Geometric Preferential Attachment Model
of Networks II. Internet Mathematics, 4(1):87–112, 2007.
[7] J. Hasslegrave and J. Jordan. Preferential attachment with choice. Random Structures & Algorithms, 2015. To appear.
[8] P. Hoff, A. Raftery, and M. Handcock. Latent space approaches to social network
analysis. Journal of the American Statistical Association, 97:1090–1098, 2001.
26
[9] E. Jacob and P. Mörters.
arXiv:1504.00618, 2015.
Robustness of scale-free spatial networks.
[10] E. Jacob and P. Mörters. Spatial preferential attachment networks: Power laws
and clustering coefficients. Annals of Applied Probability, 25(2):632–662, 2015.
[11] S. Janson, T. Luczak, and A. Ruciński. Random Graphs. Wiley, New York, USA,
2000.
[12] J. Janssen, M. Hurshman, and N. Kalyaniwalla. Model selection for social networks
using graphlets. Internet Mathematics, 8(4):338–363, 2012.
[13] J. Janssen, P. Pralat, and R. Wilson. Asymmetric distribution of nodes in the
spatial preferred attachment model. In Proceedings of the 10th Workshop on Algorithms and Models for the Web Graph (WAW 2013), volume 8305 of Lecture Notes
in Computer Science, pages 1–13. Springer, 2013.
[14] J. Janssen, P. Pralat, and R. Wilson. Geometric graph properties of the spatial
preferred attachment model. Advances in Applied Mathematics, 50(2):243–267,
2013.
[15] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In
Knowledge Discovery and Data Mining, pages 538–543, 2002.
[16] J. Jordan. Degree sequences of geometric preferential attachment graphs. Advances
in Applied Probability, 42:319–330, 2010.
[17] J. Jordan. Geometric preferential attachment in non-uniform metric spaces. Electronic Journal of Probability, 18(8):1–15, 2013.
[18] J. Jordan and A. Wade. Phase transitions for random geometric preferential attachment graphs. Advances in Applied Probability, 47(2):565–588, 2015.
[19] M. Kobayakawa, S. Kinjo, M. Hoshi, T. Ohmori, and A. Yamamoto. Fast computation of similarity based on jaccard coefficient for composition-based image
retrieval. In Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, PCM ’09, pages 949–955. SpringerVerlag, 2009.
[20] H. Small. Co-citation in the scientific literature: A new measure of the relationship
between two documents. Journal of the American Society for Information Science,
24(4):265–269, 1973.
27