Download A relational distance based approach to network evolution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Actor–network theory wikipedia , lookup

Social network wikipedia , lookup

Network society wikipedia , lookup

Six degrees of separation wikipedia , lookup

Social network (sociolinguistics) wikipedia , lookup

Social network analysis wikipedia , lookup

Transcript
A relational distance based
approach to network
evolution
Domenico De Stefano
Network Dynamics and Ties Formation
y
Network dynamic consists in developing and validating simulations,
formal models or statistical techniques to study network change,
evolution, adaptation, decay
y
Newtork dynamic is the study of a SIMPLE MECHANISM: Tie(s)
Formation
y
Empirical research identified a host of network patterns :
clustering in affective networks (friendship, trust, support)
¾ brokerage in instrumental networks (professional advice
¾ core-periphery structures in communities (communication)
¾ homophily, balance, etc.
¾
Approaches in Networks Dynamic
y
y
y
y
y
y
y
Two main approaches in Network dynamic:
(1) Simulation studies
Aim: Generation of reasonably realistic networks.
Examples: Small worlds / scale free networks (Barabasi, 19;
Watts an Strogatz, ; Newman)
Problems: The constructions are highly dependent on the
specific network-generating algorithm employed
(2) Empirical studies
Fitting of data sets
Aim: Testing of hypotheses on the network ties formation
Example: Actor Based Model (Sijders 1996)
Problems: Related to models parameters estimation
Aims
y
The main goal is to develop a simple mechanism that govern ties
formation and take into account some portion of the evolution of
a network G over different time occasions.
y
Adopting an intermediate point of view between the simulation
and the empirical approach: the proposed mechanism should be
supported by the empirical observations
y
Proposing a baseline network evolution model (dyadic
independence) that could explain ties formation in terms of
structural properties of the network and in all situations in which
actors have incomplete information on the others – except
general network information (i.e. to be friends of friends, etc.)
It could be the case of web social networking or competitors in
open markets…
The main research question
Given an observed set of relations (our
system current state), and defined a
cumulative distance in such a way this
distance represents actors' relational
proximity, it is possible to infer on the
connections that could be appear/disappear
in the successive time observation?
Outline of the proposed approach
1) Definition of the used relational distance (the so-called
Euclidean Commute-time Distance - ECTD) based on the
laplacian of the network
2) Definition of the assumptions and steps of the proposed
procedure given the observed networks and the
distances among actors
3) Simple application of the evolving algorithm
Basic Notations
V is a set of cardinality n which elements are called actors
{v1, v2, …,vn} or {1, 2, …, i ,…, n}
Let t = 0,1,2,…,K, the number of “discrete” time occasions on which the
network relations are measured (i.e. K panel networks) on the same
actor-set
Thus Et is the set of the m observed unordered couples of elements
{vr,vs}=eh (also denoted as iÆj) V {ei1 , ei2 , …, eim}, at the time occasion t
Then Gt(V, Et) is the network generated by V and by their ties at the i-th
time (Et )
Let At be the nxn adjacency matrices associated to the networks (whose
elements are aij) and Dt the nxn diagonal degree matrices (i.e. the
matrix in which the diagonal elements di are the degrees of the i-th
actor).
Therefore, in the following discussion, we consider only networks
represented by simple unweighted graphs.
Definition of the relational distance (ECTD) (1)
The procedure to compute similarities (and distances) among
actors is based on:
Markov Chain on Graph S (Gobel & Jagers,1974)
LAPLACIAN (F. Chung, 1997)
Let define a Random Walk on the network by assigning a
transition probability to each link.
A random variable s(t)=i indicates that the current state of S is at
the node i.
The random walk is defined with the single step transition
probability of being to the state s(t+1)=j given s(t)=i:
P ( s (t + 1) = j | s (t ) = i ) =
aij
ai.
= pij
Definition of the relational distance (ECTD) (2)
Two basic quantities are defined for measuring the time
(distance in terms of steps) that a rw needs to reach some state
s(.):
Average First Passage Time AFPT (Kemeny & Snell, 1976)
Average Commute-Time ACT (Klein &, Randic, 1993)
It is possible to demonstrate (Fouss et al, 2007) that the
elements of the pseudo-inverse L+ of the laplacian matrix L,
which is the matrix:
L = D – A Æ L+= inv(I+L)
(where I is the identity matrix)
Are related to the AFPT and to the ACT.
Moreover the square root of the ACT is a distance in the
euclidean space spanned by the nodes of the graph
This is our relational distance euclidean Commmute-Time
Distance ECTD
Properties of the ECTD distance (3)
ECTD is a distance (the distance axioms hold) and is euclidean
• ECTD(i,j) has this desirable property:
It decreases when the number of paths connecting two
nodes increases and when the length of these paths
decreases
•
For example in this network the
geodesic between the blue node
is at the same distance from the
green and the black (geodesic
=1).
In terms of ECTD, the blue and
the green are closer (ectd=
21.93) than the blue and the
black (ectd=18.85) because
there are many short paths
connecting them (i.e. they share
more neighbors)
Basic assumptions of the procedure
y
We define network evolution in terms of two possible events:
Two unconnected actors involved in relationships with quite
the same individuals are likely to activate a relationship.
2) two connected actors that share few links with the same
individuals it is likely to happen the opposite, (i.e. it is possible
that the link between them could ceases to exist).
1)
y
Only one event event happens at the time t, which means that
each generated network differs from the previous just for one link
or micro step (Snijders et al. 2009)
Definition of the procedure (1)
y
Let G0 the network observed at the time t=0 and suppose other
network observations are available G1,G2,…,GK
y
The first step is to specify an inner model θ (a model on the existing
nodes of the network) for our G0
y
θ is a function that at every time occasion t maps the nodes i (or the
ties) to a probability pt(i|θ)
y
The probability distribution, under our model θ, could be specified
for the connected ties θC (the set C of the ones that at the current
time are connected (i,j)+, Ci=E), for the disconnected (i,j)- θU (the set
U of the ones that at the current time are and U= ¬E)
−
U → [ 0 ,1] : Pr(( i , j ) ) =
exp[ − ectd (( i , j ) − )]
∑ exp[ − ectd (( i , j ) − )]
( i , j ) − ∈U
+
C → [ 0 ,1] : Pr(( i , j ) ) =
exp[ − 1 / ectd (( i , j ) + )]
∑ exp[ − 1 / ectd (( i , j ) + )]
( i , j ) + ∈C
y
In general the associated probability of connecting to a node in the
next time occasion is proportional to the observed ectd
Definition of the procedure (2)
Other probability distributions can be modeled (i.e. preferential
attachment we set the probability proportional to node degree
y Let specify a null model θ0 . Generally θ0 is a model where
probability over ties are constants (random graph)
y
y
The evolution procedure will be tested, following the FETA approach
(Framework for Evolving Topology Analysis) developed from Clegg et al.,
2009 which is based on the likelihood ratio of the proposed model against
the null model:
1/ t
⎡
⎤
L (C | θ )
K
c
=
0
⎢
⎥
L(C | θ ) = ∏ pt (choicet | θ )
⎣ L (C | θ 0 ) ⎦
t =0
The statistics of interest C are
measured over the time occasions
under the proposed model and the
Normalized (by time occasions)
likelihood ratio test is computed
The statistic c0 measures how
much “better” than random the
model is (> 1 better than
random and < 1 worse).
The steps of the algorithm (1)
The algorithm starts at the time t=0, knowing at least one of the
successive networks Gt>0
1. Divide G0 into two subsets: the connected actors C0 and the
unconnected actors U0. Compute the probabilities according to the θC
and θU
2. Measuring the ECTD distance; among the nodes in C0 the connected
couple (i,j)(+) with the larger ECTD will be the candidate link that could
be disappear in the following step of the chain G0ÆG(-)1
3. In order to decide if this transition will occur we compute a likelihood
ratio between two models θ and θ0 .
θ express the influence that the observed ECTD (among the nodes in U
or in C) at current time has on tie formation (the observed tie in the next
available network observation Gt=1)
(H0 states that that the distance at the t=0 has no effect on tie formation)
The steps of the algorithm (2)
4. Deciding if the proposed transition should be accepted computing the
likelihood ratio Λ (as in the model of Clegg, 2009) :
Λ = ρ (G1(-))/ ρ (G0)
y If Λ ≥1then we accept transition G0 Æ G1(-) ;
y If Λ < 1 then we accept transition G0 Æ G1(-) with probability Λ and
reject it with probability (1- Λ)
5. If the transition is accepted next candidates and ECTD and model
must be specified on the G1(-) otherwise again on G0
6. Select all the unconnected node in G1(-) (or in G0) and measure their
ECTD distance; the unconnected pair (i,j)+ with the smaller ECTD
will be the candidate link e+(i,j) that could be appear in the following
state, GtÆGt+1(+) , compute the Λ and follow the same rules
Λ = ρ (Gt(+))/ ρ (G0)
7. iterate this procedure until the time T or until the value of Λ for both
types of transitions becomes very small
Some preliminary result
Let made an example of how the mechanism works
In this preliminary applications we use as kernel function to
obtain the likelihood and the ρ() a simple logistic regression
where the independent variable is the observed ECTD at
the current time (starting from t=0) and the dependent
variable are the observed ties in the next observed network
yWe start from two observed networks:
EXAMPLE DATA: Networks among the students of the
e-learning class of statistics at the faculty of Sociology
¾ 25 students
¾ Ties are the friendship in the “virtual room”
¾ 2 measurement points
y
Network t=0
Density = 0.1367
Degree Centralization = 0.119
Transitivity = 0.5128
Network t=1
Density = 0.23
Degree Centralization = 0.297
Transitivity = 0.488
The computed ECTD distance matrices
ECTD
of net at t=0
v1
v2
v3
v4
v5
v6
v1
0.000
9.644
9.392
8.924
8.111
8.939
v1
v2
v3
v4
v5
v6
v2
9.644
0.000
8.118
8.555
7.796
6.984
v1
0.000
7.206
8.731
8.093
5.406
8.134
v3
9.392
8.118
0.000
8.295
7.500
6.468
v2
7.206
0.000
8.350
7.910
6.720
7.738
v4
8.924
8.555
8.295
0.000
6.835
7.771
v3
8.731
8.093
0.000
9.309
8.336
9.256
v5
8.111
7.796
7.500
6.835
0.000
6.872
v4
2.680
2.670
9.309
0.000
7.651
8.752
v5
5.406
6.720
8.336
7.651
0.000
7.709
ECTD of net
at t=1
Selection of the Candidate nodes in set C
The procedure starts selecting the candidate nodes whose links may
disappear at the successive time occasion (set of connected actors)
Candidates Obs.ECTD (t=0)
v3,v22
6.953
v6,v15
6.947
v22,v24
6.940
v6,v19
6.718
v23,v19
6.208
v4,v25
6.190
v21,v7
6.163
Obs. tie (t=1)
0
1
1
1
0
0
0
v1
v2
v3
…
v25
v1
0.000
7.206
8.731
…
7.531
v2
7.206
0.000
8.350
…
7.374
…
…
…
…
…
…
v22
7.565
6.917
6.953
…
7.721
This is the largest distance
among the set of the connected
nodes at t0
If the model of the ectd distance and the presence of
disconnected ties at t1 fits, according to our process, twe generate
a network Gt=1 in which the link between v3 and v22 does not
exists
Fit the model to accept transition
We fit a logistic regression model (assuming dyadic independence, baseline
model) on the connected nodes in order to assess the influence on the “ties
formation”
We fit a simple logistic
regression on these data
glm(formula = obsTies ~ ectd1, family = binomial(logit))
Deviance Residuals:
Min
1Q
Median
3Q
-1.0651 -0.6987 -0.6576 -0.6062
Max
1.9413
Coefficients:
Estimate
(Intercept) -0.27000
ectd1
-0.14609
Std. Error
0.34716
0.04895
z value
-0.778
-2.985
Pr(>|z|)
0.43672
0.00284 **
Candidates Obs.ECTD (t=0)
v3,v22
6.953
v6,v15
6.947
v22,v24
6.940
v6,v19
6.718
v23,v19
6.208
v4,v25
6.190
v21,v7
6.163
Obs. tie (t=1)
0
1
1
1
0
0
0
In this simplified scheme we can see that the model fits,
the beta is negative (which assures a negative influence of the distance on the ties at the
following time observation)
The loglikelihood ratio is greater than one (about 1.1):
LL=8.513088, df=1 p-value=0.003526
Therefore we can accept the transition and the candidate nodes
candidate (3 and 22) transition.
It is also possible to observe that this link does not exist also in the
observed network
This is a change in micro-step according to our proposed mechanism
Thus the network G1 will differ from G0 for the absence of this tie
Selection of the Candidate nodes in set2
This process repeated for each actor in the set of the
connected actors at the current time lead to the construction
of the following matrix
Candidates Obs.ECTD (t=0)
v6,v14
9,77
v6,v15
6.947
v22,v24
6.940
v6,v19
6.718
v23,v19
6.208
v4,v25
6.190
v21,v7
6.163
Obs. tie (t=1)
0
1
1
1
0
0
0
In this case the Ectd is measured
on the new network G1
v1
v2
…
v6
v1
0.00
7.21
…
v2
7.21
0.00
…
8.13
7.74
…
…
…
…
…
v14
6.92
6.49
…
9.77
This is the smallest distance among
the set of the unconnected nodes
In this case the model does not fit the data LL=8.513088, df=1 pvalue=0.003526
Likelihood ratio is equal to 0.78 (the following configuration at time t in
which the link between v6 and v14 will appear will be accepted with
probability 0.78 )
End of the procedure
y
After k=43 iterations the model is converged because none
of the candidates can be in the following generated network
y
The threshold to stop computation is fixed at Λ =0.25
because there are problem with computing time problem of
the computing time
y
Need a test to define the distance between the generated
graphs and the last observed network (perhaps on the
adjacency matrices?)
Conclusions
y
This is a very basic procedure consisting in repeated transitions
via an ad hoc generation mechanisms empirically guided by the
data
y
The main purpose is to produce a family of generated networks
that have in common with the observed ones the tie formation
mechanism (in the case of this baseline process, at the very basic
level)
y
The problem arise in degeneracy: in some cases the full network
is rapidly reached after less than 100 iterations. Degeneracy to
the empty graph does not happen too often
y
Another problem is the definition of the function that assess the
influence of the relational distance on the ties formation. Here we
adopt a simple logistic regression but several other methods are
possible
Future works
1)
Adopt a more realistic (and complex) model to support this
mechanism: use for example actor based model framework
or ERGM model instead of logistic regression
2) Compare this mechanism to other “attachment rules” (i.e.
small world, preferential attachment and so on) in the
framework of the FETA model (Clegg et al., 2009)
3) Work on version for directed networks and adopt more
efficient estimation procedure as MCMCMLE for the
estimation of the parameters
References
y
Albert R., Barabasi A.L. (1999). Emergence of Scaling in Random Networks.
Science, 286, 509–516.
y
Bollobas, B. (2001). Modern Graph Theory, Springer-Verlag, New York, second
edition.
y
Chung, F. (1997). Spectral Graph Theory, AMS, New York.
y
R. G. Clegg, R. Landa, U. Harder, and M. Rio, “Evaluating and optimising models
of network growth,” 2009, http://arxiv.org/abs/0904.0785.
y
Doreian, P., Stokman, F.N. (Eds.) (1997). Evolution of Social Networks, Gordon and
Breach Publishers, Amsterdam.
y
Fouss, F., Pirotte, A., Renders, J., Saerens, M. (2007). Random-Walk
Computation of Similarities between Nodes of a Graph with Application of
Collaborative Recommendation, IEEE Transaction on Knowledge and Data
Engineering 19.
y
H. Haddadi, G. Iannaccone, A. Moore, R. Mortier, and M. Rio, “Network
topologies: Inference, modelling and generation,” IEEE Comm. Surveys and
Tutorials, vol. 10, no. 2, 2008.
y
Holland, P.W., Leinhardt, S. (1977). A dynamic model for social networks. Journal
of Mathematical Sociology, 5, 5-20.
y
Holland, P.W., Leinhardt, S. (1975). Local structure in social networks, in:
Sociological Methodology 1976, D.R. Heise (Ed.), Jossey-Bass, San Francisco, 1-45.
y
Kemeny, J.G., Snell, J.L. (1976). Finite Markov Chains. Springer-Verlag.
y
Jagers, A.A., Gobel F. (1974). Random walks on graphs. Stochastic Processes and
Their Applications, 2, 311–336.
y
Snijders, T.A.B. (1996). Stochastic actor-oriented dynamic network analysis,
Journal of Mathematical Sociology 21, 149–172.
y
Snijders, T.A.B., (2005). Models for longitudinal network data, in: Models and
Methods in Social Network Analysis, Carrington, P.J., Scott, J., Wasserman, S. (Eds.),
Cambridge University Press, New York, 215–247.
y
Snijders T.A.B., van de Bunt G.G., Steglich, C.E.G. (2009), Introduction to
stochastic actor-based models for network dynamics, Social networks, in press.
y
Spencer J. (2000). The Strange Logic of Random Graphs. Springer, New York.
y
Wasserman S. (1980). Analyzing Social Networks As Stochastic Processes,
Journal of American Statistical Association, 75.
y
Wasserman, S., Faust, K., (1994). Social Network Analysis: Methods and Applications.
Cambridge University Press, New York and Cambridge.
y
Watts, D. J., Strogatz S.H. (1998). Collective Dynamics of ’Small World’
Networks, Nature, 363(6684), 409–410.
Probability map
exp[−1 / ectd ((i, j ) + )]
C → [0,1] : Pr((i, j ) ) =
∑ exp[−1 / ectd ((i, j ) + )]
+
( i , j ) + ∈C
U → [0,1] : Pr((i, j ) − ) =
exp[−ectd ((i, j ) − )]
∑ exp[−ectd ((i, j ) − )]
( i , j ) − ∈U
Pr(U ) = {
Probability map
So basically our networks is
composed of two (dyadic) disjoint
sets of node pairs (i.e. ties): the set
C of the ones that at the current
time are connected (i,j)+; the set U
of the ones that at the current time
are disconnected (i,j)-; C=E and U=
¬E
Probability map
Statistical inference for logistic regression models typically
involves large sample approximations based on the
unconditional likelihood. Unfortunately, these asymptotic
approximations are unreliable when sample sizes are small or
the data are sparse or skewed. In these situations, exact
inference is reliable no matter how small or imbalanced the
data set. Exact inference is based on the conditional
distribution of the sufficient statistics for the parameters of
interest given the observed values for the remaining
sufficient statistics. Current implementations of exact logistic
regression have difficulty handling large data sets with
conditional distributions whose support is too large to be
represented in memory. ELRM extends an existing algorithm
for (approximate) exact inference to accommodate large data
sets.
Example data (Knecht, 2003/04)
Networks among first grade pupils at Dutch secondary
Schools:
y
y
y
125 school classes
4 measurement points,
various network & individual measures.
For simplicity here we consider the friendship in one class
(class03e) in just just two waves (measurement points t=0
and t=1)