Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A relational distance based approach to network evolution Domenico De Stefano Network Dynamics and Ties Formation y Network dynamic consists in developing and validating simulations, formal models or statistical techniques to study network change, evolution, adaptation, decay y Newtork dynamic is the study of a SIMPLE MECHANISM: Tie(s) Formation y Empirical research identified a host of network patterns : clustering in affective networks (friendship, trust, support) ¾ brokerage in instrumental networks (professional advice ¾ core-periphery structures in communities (communication) ¾ homophily, balance, etc. ¾ Approaches in Networks Dynamic y y y y y y y Two main approaches in Network dynamic: (1) Simulation studies Aim: Generation of reasonably realistic networks. Examples: Small worlds / scale free networks (Barabasi, 19; Watts an Strogatz, ; Newman) Problems: The constructions are highly dependent on the specific network-generating algorithm employed (2) Empirical studies Fitting of data sets Aim: Testing of hypotheses on the network ties formation Example: Actor Based Model (Sijders 1996) Problems: Related to models parameters estimation Aims y The main goal is to develop a simple mechanism that govern ties formation and take into account some portion of the evolution of a network G over different time occasions. y Adopting an intermediate point of view between the simulation and the empirical approach: the proposed mechanism should be supported by the empirical observations y Proposing a baseline network evolution model (dyadic independence) that could explain ties formation in terms of structural properties of the network and in all situations in which actors have incomplete information on the others – except general network information (i.e. to be friends of friends, etc.) It could be the case of web social networking or competitors in open markets… The main research question Given an observed set of relations (our system current state), and defined a cumulative distance in such a way this distance represents actors' relational proximity, it is possible to infer on the connections that could be appear/disappear in the successive time observation? Outline of the proposed approach 1) Definition of the used relational distance (the so-called Euclidean Commute-time Distance - ECTD) based on the laplacian of the network 2) Definition of the assumptions and steps of the proposed procedure given the observed networks and the distances among actors 3) Simple application of the evolving algorithm Basic Notations V is a set of cardinality n which elements are called actors {v1, v2, …,vn} or {1, 2, …, i ,…, n} Let t = 0,1,2,…,K, the number of “discrete” time occasions on which the network relations are measured (i.e. K panel networks) on the same actor-set Thus Et is the set of the m observed unordered couples of elements {vr,vs}=eh (also denoted as iÆj) V {ei1 , ei2 , …, eim}, at the time occasion t Then Gt(V, Et) is the network generated by V and by their ties at the i-th time (Et ) Let At be the nxn adjacency matrices associated to the networks (whose elements are aij) and Dt the nxn diagonal degree matrices (i.e. the matrix in which the diagonal elements di are the degrees of the i-th actor). Therefore, in the following discussion, we consider only networks represented by simple unweighted graphs. Definition of the relational distance (ECTD) (1) The procedure to compute similarities (and distances) among actors is based on: Markov Chain on Graph S (Gobel & Jagers,1974) LAPLACIAN (F. Chung, 1997) Let define a Random Walk on the network by assigning a transition probability to each link. A random variable s(t)=i indicates that the current state of S is at the node i. The random walk is defined with the single step transition probability of being to the state s(t+1)=j given s(t)=i: P ( s (t + 1) = j | s (t ) = i ) = aij ai. = pij Definition of the relational distance (ECTD) (2) Two basic quantities are defined for measuring the time (distance in terms of steps) that a rw needs to reach some state s(.): Average First Passage Time AFPT (Kemeny & Snell, 1976) Average Commute-Time ACT (Klein &, Randic, 1993) It is possible to demonstrate (Fouss et al, 2007) that the elements of the pseudo-inverse L+ of the laplacian matrix L, which is the matrix: L = D – A Æ L+= inv(I+L) (where I is the identity matrix) Are related to the AFPT and to the ACT. Moreover the square root of the ACT is a distance in the euclidean space spanned by the nodes of the graph This is our relational distance euclidean Commmute-Time Distance ECTD Properties of the ECTD distance (3) ECTD is a distance (the distance axioms hold) and is euclidean • ECTD(i,j) has this desirable property: It decreases when the number of paths connecting two nodes increases and when the length of these paths decreases • For example in this network the geodesic between the blue node is at the same distance from the green and the black (geodesic =1). In terms of ECTD, the blue and the green are closer (ectd= 21.93) than the blue and the black (ectd=18.85) because there are many short paths connecting them (i.e. they share more neighbors) Basic assumptions of the procedure y We define network evolution in terms of two possible events: Two unconnected actors involved in relationships with quite the same individuals are likely to activate a relationship. 2) two connected actors that share few links with the same individuals it is likely to happen the opposite, (i.e. it is possible that the link between them could ceases to exist). 1) y Only one event event happens at the time t, which means that each generated network differs from the previous just for one link or micro step (Snijders et al. 2009) Definition of the procedure (1) y Let G0 the network observed at the time t=0 and suppose other network observations are available G1,G2,…,GK y The first step is to specify an inner model θ (a model on the existing nodes of the network) for our G0 y θ is a function that at every time occasion t maps the nodes i (or the ties) to a probability pt(i|θ) y The probability distribution, under our model θ, could be specified for the connected ties θC (the set C of the ones that at the current time are connected (i,j)+, Ci=E), for the disconnected (i,j)- θU (the set U of the ones that at the current time are and U= ¬E) − U → [ 0 ,1] : Pr(( i , j ) ) = exp[ − ectd (( i , j ) − )] ∑ exp[ − ectd (( i , j ) − )] ( i , j ) − ∈U + C → [ 0 ,1] : Pr(( i , j ) ) = exp[ − 1 / ectd (( i , j ) + )] ∑ exp[ − 1 / ectd (( i , j ) + )] ( i , j ) + ∈C y In general the associated probability of connecting to a node in the next time occasion is proportional to the observed ectd Definition of the procedure (2) Other probability distributions can be modeled (i.e. preferential attachment we set the probability proportional to node degree y Let specify a null model θ0 . Generally θ0 is a model where probability over ties are constants (random graph) y y The evolution procedure will be tested, following the FETA approach (Framework for Evolving Topology Analysis) developed from Clegg et al., 2009 which is based on the likelihood ratio of the proposed model against the null model: 1/ t ⎡ ⎤ L (C | θ ) K c = 0 ⎢ ⎥ L(C | θ ) = ∏ pt (choicet | θ ) ⎣ L (C | θ 0 ) ⎦ t =0 The statistics of interest C are measured over the time occasions under the proposed model and the Normalized (by time occasions) likelihood ratio test is computed The statistic c0 measures how much “better” than random the model is (> 1 better than random and < 1 worse). The steps of the algorithm (1) The algorithm starts at the time t=0, knowing at least one of the successive networks Gt>0 1. Divide G0 into two subsets: the connected actors C0 and the unconnected actors U0. Compute the probabilities according to the θC and θU 2. Measuring the ECTD distance; among the nodes in C0 the connected couple (i,j)(+) with the larger ECTD will be the candidate link that could be disappear in the following step of the chain G0ÆG(-)1 3. In order to decide if this transition will occur we compute a likelihood ratio between two models θ and θ0 . θ express the influence that the observed ECTD (among the nodes in U or in C) at current time has on tie formation (the observed tie in the next available network observation Gt=1) (H0 states that that the distance at the t=0 has no effect on tie formation) The steps of the algorithm (2) 4. Deciding if the proposed transition should be accepted computing the likelihood ratio Λ (as in the model of Clegg, 2009) : Λ = ρ (G1(-))/ ρ (G0) y If Λ ≥1then we accept transition G0 Æ G1(-) ; y If Λ < 1 then we accept transition G0 Æ G1(-) with probability Λ and reject it with probability (1- Λ) 5. If the transition is accepted next candidates and ECTD and model must be specified on the G1(-) otherwise again on G0 6. Select all the unconnected node in G1(-) (or in G0) and measure their ECTD distance; the unconnected pair (i,j)+ with the smaller ECTD will be the candidate link e+(i,j) that could be appear in the following state, GtÆGt+1(+) , compute the Λ and follow the same rules Λ = ρ (Gt(+))/ ρ (G0) 7. iterate this procedure until the time T or until the value of Λ for both types of transitions becomes very small Some preliminary result Let made an example of how the mechanism works In this preliminary applications we use as kernel function to obtain the likelihood and the ρ() a simple logistic regression where the independent variable is the observed ECTD at the current time (starting from t=0) and the dependent variable are the observed ties in the next observed network yWe start from two observed networks: EXAMPLE DATA: Networks among the students of the e-learning class of statistics at the faculty of Sociology ¾ 25 students ¾ Ties are the friendship in the “virtual room” ¾ 2 measurement points y Network t=0 Density = 0.1367 Degree Centralization = 0.119 Transitivity = 0.5128 Network t=1 Density = 0.23 Degree Centralization = 0.297 Transitivity = 0.488 The computed ECTD distance matrices ECTD of net at t=0 v1 v2 v3 v4 v5 v6 v1 0.000 9.644 9.392 8.924 8.111 8.939 v1 v2 v3 v4 v5 v6 v2 9.644 0.000 8.118 8.555 7.796 6.984 v1 0.000 7.206 8.731 8.093 5.406 8.134 v3 9.392 8.118 0.000 8.295 7.500 6.468 v2 7.206 0.000 8.350 7.910 6.720 7.738 v4 8.924 8.555 8.295 0.000 6.835 7.771 v3 8.731 8.093 0.000 9.309 8.336 9.256 v5 8.111 7.796 7.500 6.835 0.000 6.872 v4 2.680 2.670 9.309 0.000 7.651 8.752 v5 5.406 6.720 8.336 7.651 0.000 7.709 ECTD of net at t=1 Selection of the Candidate nodes in set C The procedure starts selecting the candidate nodes whose links may disappear at the successive time occasion (set of connected actors) Candidates Obs.ECTD (t=0) v3,v22 6.953 v6,v15 6.947 v22,v24 6.940 v6,v19 6.718 v23,v19 6.208 v4,v25 6.190 v21,v7 6.163 Obs. tie (t=1) 0 1 1 1 0 0 0 v1 v2 v3 … v25 v1 0.000 7.206 8.731 … 7.531 v2 7.206 0.000 8.350 … 7.374 … … … … … … v22 7.565 6.917 6.953 … 7.721 This is the largest distance among the set of the connected nodes at t0 If the model of the ectd distance and the presence of disconnected ties at t1 fits, according to our process, twe generate a network Gt=1 in which the link between v3 and v22 does not exists Fit the model to accept transition We fit a logistic regression model (assuming dyadic independence, baseline model) on the connected nodes in order to assess the influence on the “ties formation” We fit a simple logistic regression on these data glm(formula = obsTies ~ ectd1, family = binomial(logit)) Deviance Residuals: Min 1Q Median 3Q -1.0651 -0.6987 -0.6576 -0.6062 Max 1.9413 Coefficients: Estimate (Intercept) -0.27000 ectd1 -0.14609 Std. Error 0.34716 0.04895 z value -0.778 -2.985 Pr(>|z|) 0.43672 0.00284 ** Candidates Obs.ECTD (t=0) v3,v22 6.953 v6,v15 6.947 v22,v24 6.940 v6,v19 6.718 v23,v19 6.208 v4,v25 6.190 v21,v7 6.163 Obs. tie (t=1) 0 1 1 1 0 0 0 In this simplified scheme we can see that the model fits, the beta is negative (which assures a negative influence of the distance on the ties at the following time observation) The loglikelihood ratio is greater than one (about 1.1): LL=8.513088, df=1 p-value=0.003526 Therefore we can accept the transition and the candidate nodes candidate (3 and 22) transition. It is also possible to observe that this link does not exist also in the observed network This is a change in micro-step according to our proposed mechanism Thus the network G1 will differ from G0 for the absence of this tie Selection of the Candidate nodes in set2 This process repeated for each actor in the set of the connected actors at the current time lead to the construction of the following matrix Candidates Obs.ECTD (t=0) v6,v14 9,77 v6,v15 6.947 v22,v24 6.940 v6,v19 6.718 v23,v19 6.208 v4,v25 6.190 v21,v7 6.163 Obs. tie (t=1) 0 1 1 1 0 0 0 In this case the Ectd is measured on the new network G1 v1 v2 … v6 v1 0.00 7.21 … v2 7.21 0.00 … 8.13 7.74 … … … … … v14 6.92 6.49 … 9.77 This is the smallest distance among the set of the unconnected nodes In this case the model does not fit the data LL=8.513088, df=1 pvalue=0.003526 Likelihood ratio is equal to 0.78 (the following configuration at time t in which the link between v6 and v14 will appear will be accepted with probability 0.78 ) End of the procedure y After k=43 iterations the model is converged because none of the candidates can be in the following generated network y The threshold to stop computation is fixed at Λ =0.25 because there are problem with computing time problem of the computing time y Need a test to define the distance between the generated graphs and the last observed network (perhaps on the adjacency matrices?) Conclusions y This is a very basic procedure consisting in repeated transitions via an ad hoc generation mechanisms empirically guided by the data y The main purpose is to produce a family of generated networks that have in common with the observed ones the tie formation mechanism (in the case of this baseline process, at the very basic level) y The problem arise in degeneracy: in some cases the full network is rapidly reached after less than 100 iterations. Degeneracy to the empty graph does not happen too often y Another problem is the definition of the function that assess the influence of the relational distance on the ties formation. Here we adopt a simple logistic regression but several other methods are possible Future works 1) Adopt a more realistic (and complex) model to support this mechanism: use for example actor based model framework or ERGM model instead of logistic regression 2) Compare this mechanism to other “attachment rules” (i.e. small world, preferential attachment and so on) in the framework of the FETA model (Clegg et al., 2009) 3) Work on version for directed networks and adopt more efficient estimation procedure as MCMCMLE for the estimation of the parameters References y Albert R., Barabasi A.L. (1999). Emergence of Scaling in Random Networks. Science, 286, 509–516. y Bollobas, B. (2001). Modern Graph Theory, Springer-Verlag, New York, second edition. y Chung, F. (1997). Spectral Graph Theory, AMS, New York. y R. G. Clegg, R. Landa, U. Harder, and M. Rio, “Evaluating and optimising models of network growth,” 2009, http://arxiv.org/abs/0904.0785. y Doreian, P., Stokman, F.N. (Eds.) (1997). Evolution of Social Networks, Gordon and Breach Publishers, Amsterdam. y Fouss, F., Pirotte, A., Renders, J., Saerens, M. (2007). Random-Walk Computation of Similarities between Nodes of a Graph with Application of Collaborative Recommendation, IEEE Transaction on Knowledge and Data Engineering 19. y H. Haddadi, G. Iannaccone, A. Moore, R. Mortier, and M. Rio, “Network topologies: Inference, modelling and generation,” IEEE Comm. Surveys and Tutorials, vol. 10, no. 2, 2008. y Holland, P.W., Leinhardt, S. (1977). A dynamic model for social networks. Journal of Mathematical Sociology, 5, 5-20. y Holland, P.W., Leinhardt, S. (1975). Local structure in social networks, in: Sociological Methodology 1976, D.R. Heise (Ed.), Jossey-Bass, San Francisco, 1-45. y Kemeny, J.G., Snell, J.L. (1976). Finite Markov Chains. Springer-Verlag. y Jagers, A.A., Gobel F. (1974). Random walks on graphs. Stochastic Processes and Their Applications, 2, 311–336. y Snijders, T.A.B. (1996). Stochastic actor-oriented dynamic network analysis, Journal of Mathematical Sociology 21, 149–172. y Snijders, T.A.B., (2005). Models for longitudinal network data, in: Models and Methods in Social Network Analysis, Carrington, P.J., Scott, J., Wasserman, S. (Eds.), Cambridge University Press, New York, 215–247. y Snijders T.A.B., van de Bunt G.G., Steglich, C.E.G. (2009), Introduction to stochastic actor-based models for network dynamics, Social networks, in press. y Spencer J. (2000). The Strange Logic of Random Graphs. Springer, New York. y Wasserman S. (1980). Analyzing Social Networks As Stochastic Processes, Journal of American Statistical Association, 75. y Wasserman, S., Faust, K., (1994). Social Network Analysis: Methods and Applications. Cambridge University Press, New York and Cambridge. y Watts, D. J., Strogatz S.H. (1998). Collective Dynamics of ’Small World’ Networks, Nature, 363(6684), 409–410. Probability map exp[−1 / ectd ((i, j ) + )] C → [0,1] : Pr((i, j ) ) = ∑ exp[−1 / ectd ((i, j ) + )] + ( i , j ) + ∈C U → [0,1] : Pr((i, j ) − ) = exp[−ectd ((i, j ) − )] ∑ exp[−ectd ((i, j ) − )] ( i , j ) − ∈U Pr(U ) = { Probability map So basically our networks is composed of two (dyadic) disjoint sets of node pairs (i.e. ties): the set C of the ones that at the current time are connected (i,j)+; the set U of the ones that at the current time are disconnected (i,j)-; C=E and U= ¬E Probability map Statistical inference for logistic regression models typically involves large sample approximations based on the unconditional likelihood. Unfortunately, these asymptotic approximations are unreliable when sample sizes are small or the data are sparse or skewed. In these situations, exact inference is reliable no matter how small or imbalanced the data set. Exact inference is based on the conditional distribution of the sufficient statistics for the parameters of interest given the observed values for the remaining sufficient statistics. Current implementations of exact logistic regression have difficulty handling large data sets with conditional distributions whose support is too large to be represented in memory. ELRM extends an existing algorithm for (approximate) exact inference to accommodate large data sets. Example data (Knecht, 2003/04) Networks among first grade pupils at Dutch secondary Schools: y y y 125 school classes 4 measurement points, various network & individual measures. For simplicity here we consider the friendship in one class (class03e) in just just two waves (measurement points t=0 and t=1)