Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Is a sampled network a good enough descriptor for epidemic 2 predictions? Missing links and appropriate choice of representation 3 Jenny Lennartssona, Annie Jonssona, Nina Håkanssona,b and Uno Wennergrenb,* 4 5 a Systems Biology Research Centre, Skövde University, Box 408, 541 28, Skövde, Sweden 6 b IFM Theory and Modeling, Linköping University, 581 83 Linköping, Sweden 7 * Corresponding author. IFM Theory and Modeling, Linköping University, 581 83 8 Linköping, Sweden. Phone: +46 13 281666. Fax: +46 13 281399 9 E-mail address: [email protected] 10 11 12 ABSTRACT 13 14 Lack of complete data sets can be a limitation in network analysis. Here, we studied how 15 link density affects the properties of disease transmission networks. Networks with 16 weighted links were used to run scenarios assuming distance dependent probabilities of 17 disease transmission, which were subsequently compared with scenarios where 18 probabilities of disease transmission were randomly drawn (i.e. non-distance dependence). 19 In both types of scenarios, two link sampling methods were tested, one based on distance 20 dependence and the other on a random approach. This allowed us to study how link density 21 influences the spread of disease in networks generated using different link sampling 22 methods and transmission scenarios. We conclude that, under the assumption of distance 1 23 dependence of both link sampling and disease transmission, predictions about the extent of 24 an epidemic can be drawn from a network, even at a link density that is low, albeit higher 25 than in most empirical studies. In reality, neither sampling procedures nor disease 26 transmissions fit distance dependence perfectly. Our results show how this enforces an even 27 higher level of link density in sampled networks to achieve reasonable predictions for 28 disease transmission. 29 30 Keywords: epidemic modelling, link density, link sampling, network analysis, disease 31 transmission 32 33 34 35 1. INTRODUCTION 36 37 During recent years, there has been growing interest in and use of network analysis in 38 epidemiology. A network consists of interacting units, here denoted nodes, and these units 39 are connected through relationships we call links. Examples of nodes are individual animals 40 or animal holdings, and links can be visits or animal transports. The pattern of links 41 between the nodes gives rise to networks with contact structures that differ depending on 42 both the amount of links and how the links are organized. Here, we classify networks into 43 three categories: (i) the complete network (Wasserman and Faust, 1994), which includes all 44 theoretically possible links (figure 1a); (ii) the real-world network, which comprises all 45 realizations of links during a specified time period (figure 1b); (iii) the sampled network, 2 46 which encompasses the links measured during the sample period (figure 1c). In addition, to 47 estimate the link structure of networks, probabilities of occurrence or disease transmission 48 can be measured per time unit of individual links, in which case the network is referred to 49 as being weighted (Barrat et al., 2004). In contrast to classical epidemiological models such 50 as SI, SIR, and SEIR, network models relax the assumption of homogeneous mixing (mass- 51 action type of assumptions), because all nodes are not linked to all other nodes, or the links 52 are weighted, for example depending on distance. 53 54 The sampled network can be estimated through sample surveys, literature studies, or 55 contact tracing, or by use of databases (e.g. national databases for animal movements). The 56 estimation is cumbersome, and it might be expected that estimated networks will lack some 57 links and even some nodes (Clauset et al., 2008). It is the real-world network that is of 58 interest to register, but we have to use the sampled network to represent it. Hence, there is a 59 need to evaluate the effect of missing links in order to assess or possibly reduce errors when 60 networks are applied. The current study focused on how the number of links in sampled 61 networks affects predictions of the size of epidemics. We simulated spread of disease in 62 networks with different link densities and different scenarios that mimicked sampling 63 procedures. 64 65 A real-world weighted contact network consists of all contacts that occur during a specific 66 time period, where the link weights are estimated, as probabilities, from the contact 67 frequencies. Another time period constitutes yet another event along with its specific 68 contacts, which may very well lead to another network with a different set of links. Thus, 3 69 the question arises as to whether the properties of the two real networks will differ. 70 Furthermore, it must be asked whether the properties of the sampled networks for those two 71 events will differ. Is it possible that a property of the first sampled network (e.g. the spread 72 of disease) can be valid as an approximation of the spread of disease during a second event? 73 Obviously, a time period that is too short will result in a poor approximation of any of the 74 two events. On the other hand, a measured period with a very large time frame will yield a 75 nearly complete network that is almost a perfect approximation. Beyond reality is the 76 infinite sampling procedure that results in a complete network in which all links exist with 77 specified probabilities. Somewhere in between the short time period and the excessive 78 sampling, there is a sampled network that has enough links and sufficiently estimated 79 probabilities to generate an adequate approximation. 80 81 In the present study, we concentrated on disease transmission networks with weighted 82 links, because it might be expected that the probability of contact can be higher for some 83 links than for others. We ran scenarios with the assumptions of distance dependent 84 probabilities and compared the results with scenarios based on randomly drawn 85 probabilities. The distance dependence was tested for disease transmission and link 86 sampling, both separately and in combination. In a worst-case scenario, there would be a 87 mismatch when using distance dependent transmission probabilities together with random 88 link sampling. Hence, in addition we studied how the necessary amount of measured links 89 also depended on the mismatch between a real-world network and the sampling procedure. 90 We chose to focus our investigation on networks in veterinary medicine, specifically the 4 91 spread of infectious diseases between animal holdings, because the use of network analysis 92 is increasing in that field (Barthélemy et al., 2005; Ortiz-Pelaez et al., 2006). 93 94 In veterinary medicine, network analysis and modelling can be employed to predict the 95 spread of disease and epidemic size, and also to examine the effects of various intervention 96 methods, such as vaccination, stand still, and stamping out. An example of this is a study 97 performed by Corner et al. (2003), which was aimed at examining a network of wild 98 brushtail possums with regard to transmission of the pathogen Mycobacterium bovis and 99 social contacts between the animals. In another investigation, Kiss et al. (2006) analysed 100 networks of sheep movements within Great Britain and found that, during an epidemic, the 101 most efficient strategy is to concentrate control interventions on highly connected nodes. 102 Despite the increased use of network analysis in epidemiology, there are shortcomings 103 related to missing links and how to represent a structure when only a single sample is 104 available. Collected network data are often incomplete (Christley et al., 2005; Clauset et al., 105 2008; Eames et al., 2009; Guimerà and Sales-Pardo, 2009; Heath et al., 2008; Ortiz-Pelaez 106 et al., 2006), for instance, there can be missing animal movements or unknown locations of 107 herds in databases. Accordingly, Perkins et al. (2009) demonstrated that network structures 108 are only approximations of contacts and that it is almost impossible to identify all contacts 109 when collecting data. 110 111 Properties such as spread of disease can vary depending on the structure of the network 112 (Keeling, 2005; Kiss et al., 2006, Newman et al., 2001; Shirley and Rushton, 2005). Thus, 113 due to the relationship between disease transmission and network structure, results based on 5 114 networks with missing links may be misleading. In practice, this means that there will be 115 problems with missing data, which will lead to lost links in the representation of a network. 116 The links are lost due to errors that occur during the sampling period or as a consequence of 117 the finite length of the sampling period. Guimerà and Sales-Pardo (2009) introduced a 118 method to use a single measure of a network, called a sampled network, to generate a more 119 correct representation, i.e. an approximation of the real-world network. Those researchers 120 focused on networks that were reduced during sampling, and by measuring and classifying 121 the structure of the sampled network, they were able to identify either missing or spurious 122 links. By comparison, our study was more general in nature and handled the relationship 123 between link density and estimates of properties such as spread of disease and specific 124 network measures. Our results can be combined with the findings of Guimerà and Sales- 125 Pardo (2009) by stressing when to expect missing links. 126 127 When conducting a survey to achieve a sampled network, it is important to consider the 128 time window of the sampling period. For example, Kao et al. (2007) studied the 129 relationship between the network of UK livestock movement and disease dynamics on 130 different time scales. This was achieved by simulating transmission of two diseases, scrapie 131 and foot-and-mouth disease, which differ greatly regarding the time scales of the incubation 132 and infectious periods. Kao and colleagues concluded that, in order for network analysis to 133 be a valuable tool in epidemiological modelling, it is important to consider the time scale as 134 well as the potentially infectious contacts. In another study, Robinson et al. (2007) 135 investigated animal movement networks evolving over time in Great Britain, and their 136 findings point out the importance of temporal scale. With increasing length of the time 6 137 period under consideration, the networks became progressively more connected, and in that 138 way fuelled the spread of disease. Those authors also found a seasonal pattern with a peak 139 in spring and August. Thus, depending on the question to be examined or when comparing 140 different networks, it is important to choose the appropriate temporal scale (Vernon and 141 Keeling, 2009). In the current study, we discuss this time window problem in relation to the 142 difficulty of achieving a sufficient number of links during a selected period. 143 144 145 2. METHODS 146 147 2.1 The model 148 Figure 2 illustrates the process of network generation and simulation in our study. The first 149 step involved placement of animal holdings in the landscape, and the second the link- 150 forming procedures, which in this case were related to empirical sampling. The third step 151 comprised simulation runs of disease transmission. The network-generating algorithm, 152 simulations, and calculations were implemented and run in MATLAB (version R2009a). 153 154 2.1.1 Landscape of animal holdings 155 The number of animal holdings was set to 500, and these entities were randomly placed in a 156 landscape of size 34 x 34 (see figure 2). The holding density was chosen according to 157 actual farm density in southern Sweden. Each animal holding was considered to be a node, 158 which implies that each animal was not modelled individually. 159 7 160 2.1.2 Link sampling and link density 161 The animal holdings were connected by distance dependence (eq. 1; Håkansson et al., 162 2010; Lindström et al., 2008) between those entities (Dl) or completely at random (Rl). 163 164 P(lij ) K exp dij ab (1) 165 166 In the equation, P(lij) is the probability that a link is formed between nodes i and j, and di,j is 167 the Euclidian distance between holdings i and j. Parameters a and b are set by the 168 parameters kurtosis, к, and standard deviation, σ (see Lindström et al., 2008). The constant 169 K normalized the distribution such that the probabilities of all possible links summed to 170 one. For distance dependent link sampling, Dl, we used a kurtosis value of 10/3, meaning 171 an exponential distribution and a standard deviation of one. The links were sampled 172 randomly and successively from this probability distribution (eq. 1), until the desired link 173 density was achieved. Since stochasticity was included in the method, it was also possible 174 to sample links between holdings that were more distant from each other, even at a low link 175 density. For random distribution of links, Rl, the links were sampled one at time, with the 176 same probability for all links. To avoid edge effects, periodic boundaries were used 177 (Lindström et al., 2008) along the edges of the 34 x 34 landscape. 178 179 Link density, D, represents the actual connections, L, in a network as a proportion of all 180 theoretical possible links in that network (Wasserman and Faust, 1994, eq. 2). 181 8 182 D 2L nn 1 (2) 183 184 In our study, n represented the number of animal holdings in the network. In the 185 simulations, we varied link density between 0.001 and 1.0, and a density of 1.0 indicated a 186 complete network (figure 1a) that included all theoretical connections. Inasmuch as the link 187 density of the networks was set when generating the networks, the mean link degree was 188 also given from the start (table 1). 189 190 2.1.3 Disease transmission 191 As in the link sampling process, we assumed two different processes for the transmission 192 probabilities of a disease, one distance dependent, Dt, and the other completely random, Rt. 193 The two processes could represent two diseases with different behaviours. Transmission 194 rates were determined using the same processes as applied in the link sampling (see §2.1.2). 195 Hence, Dt was set by equation 1 and the same parameter values as Dl, and the transmission 196 probabilities of Rt were arbitrarily set to 0.01. 197 198 2.1.4 Model scenarios 199 Combining two link sampling processes and two disease transmission processes yielded 200 four different scenarios that we designated DlDt, DlRt, RlDt and RlRt (figure 2), and these 201 can be described as follows. The RlRt scenario is an example of a mass action mixing 202 model (Keeling, 2005) that assumes that all links have the same probability of transmitting 203 disease combined with a matching random procedure for link sampling. Matching in this 9 204 context is considered in the sense of process but not necessarily with respect to the 205 occurrence of events, i.e. two different realizations of the randomization from the same 206 process. The DlDt, which comprises linking and transmission probabilities for each link, is 207 a distance dependent scenario that involves matching between the process of probability of 208 measure and probability of transmission. Considering the combinations in the remaining 209 two scenarios, DlRt and RlDt, the link sampling procedure does not match the actual process 210 that generates probability of transmission. For example, in RlDt, transmission is distance 211 dependent and yet the link sampling procedure is random and hence expected to be 212 ineffective. Accordingly, in this case, link sampling is random, which, regardless of 213 distance, implies that some of the first connections detected will have low probabilities, 214 while some that have high probability of transmission will not be detected within the 215 sampling time frame. 216 217 2.1.5 Simulation model 218 To simulate disease transmission in the sampled networks, we used a general and very 219 simple epidemiological model, where the holdings could be in either of two phases: 220 susceptible (S) or infectious (I) (eq. 3). 221 222 dS dt SI dI dt SI (3) 223 224 Parameter λ in the equation is the probability of disease transmission from an infective 225 holding through a link to a susceptible holding, and the variables S(t) and I(t) are the 10 226 number of holdings in the susceptible and the infected phase, respectively, at time t. We did 227 not incorporate incubation time, and hence animal holdings in contact with an infected 228 holding were already able to infect other holdings during the next time step. Furthermore, a 229 recovery phase was not included in the model, and thus an infected holding remained in the 230 infectious phase during the remaining simulation time. Undirected links were used, and the 231 disease could thus be transmitted in both directions along the links. Disease transmission 232 could occur only between animal holdings that were connected by a link. It should be noted 233 that the probability of a link in the sampled network was according to Rl or Dl, whereas the 234 probability of transmission was according to Rt or Dt. 235 236 2.2 Simulation runs 237 For each link density presented in table 1, simulations were run separately for all four 238 scenarios illustrated in figure 2. One hundred different networks were generated for each of 239 the two link sampling processes, Dl and Rl, and each link density (figure 2). Also, for each 240 density and link sampling procedure, 10 replicates of randomly distributed holdings were 241 created, and, for each of these landscapes of holdings, 10 replicates of networks were made 242 by using one of the two sampling processes (see §2.1.2). For each of these sampled 243 networks, 10 simulations were performed per transmission process, Dt and Rt, by initiating 244 the spread from a randomly chosen animal holding. In all, 1000 simulations were run per 245 scenario and link density. Simulation period was set to 300 time steps, and numbers of 246 infected animal holdings were calculated for each time step. 247 248 2.3 Analysis 11 249 250 To compare the different scenarios and prediction powers determined by link density, we 251 analysed the extent of the spread of disease as the mean number of infected holdings per 252 time step, and also, the mean number of time steps elapsed until a specified proportion of 253 holdings was infected (here 10%, 50% and 90%). To characterize the networks and to 254 ascertain how a change in link density would affect the structure and function of the 255 networks, we used the following network measures: degree assortativity, clustering 256 coefficient and fragmentation index. 257 258 Degree assortativity (Newman, 2002) is a measure of to what extent nodes with equal 259 respectively unequal degree are connected. Values range from minus one to one. A value 260 near one indicates that a larger proportion of holdings with equal degree are linked to each 261 other. Assortativity near minus one corresponds to a network where holdings with a 262 different degree have a higher probability of being connected. A value of zero implies that 263 the connections between holdings are not dependent on node degree. 264 265 The clustering coefficient (Watts and Strogatz, 1998) for a holding is the number of links 266 that exist between neighbours of that holding divided by all possible links between the 267 neighbours. Here, we used the average clustering coefficient for the whole network; this 268 measure ranges between zero and one, where one indicates that the network is highly 269 clustered. 270 12 271 The fragmentation index (Borgatti, 2003; Webb, 2005) measures to what extent a network 272 is disconnected. This index ranges from zero to one; a low value indicates that the network 273 is highly connected, and a high value means that the networks are very fragmented. 274 275 276 3. RESULTS 277 278 The results show that, for the scenario with distance dependent link sampling and disease 279 transmission (DlDt), a link density of around 0.04 gave the same number of infected animal 280 holdings as it did for networks with a larger proportion of connections (figure 3). Under the 281 assumptions of our model, these findings suggest that such low proportions of links in the 282 network were sufficient to examine the extent of the disease transmission. The scenario 283 comprising random link sampling and distance dependent disease transmission (RlDt) 284 required a higher link density until a limit was reached where additional links had no 285 influence. For the scenarios involving random transmission (DlRt and RlRt), the number of 286 infected animal holdings increased with increasing link density, and no limit was reached. 287 288 Since the spread of disease is stochastic, we also studied the variation in different 289 realizations. We found variation between realizations in both cases, that is, incomplete 290 networks compared with complete networks with link density of 1.0. Hence, it is important 291 to assess both the expected and the measured variation. Figure 4 shows the median values 292 of the 1000 replicates of the simulations of the DlDt scenario plotted with the first and third 293 quartile on each side on the median curve. For a link density as low as 0.001 (figure 4a), the 13 294 median was one for the whole time period, because only in some cases was the disease 295 transmitted to other holdings. When link density increased to 0.01 (figure 4b), the 296 difference between the first and third quartiles also increased. The difference was small at 297 the beginning of the simulation time when few holdings were infected but increased over 298 the time period. Moreover, if link density increases further, up to 0.02 (figure 4c) and 0.03 299 (figure 4d), the difference between the first and third quartiles decreases. For link densities 300 from 0.04 and higher (figures 4e and 4f), the shape of the curves and the distances between 301 them are almost the same, which implies that measured and expected stochastic processes 302 generate equal variation between realizations. Of course, in the last part of the simulation 303 time, when almost all holdings are infected, the variation between realizations decreases 304 towards zero. 305 306 The time until a given proportion of the holdings were infected differed depending on the 307 link sampling scenario and the disease transmission scenario (figure 5). The random disease 308 transmission scenarios (DlRt and RlRt) required almost the same length of time to reach a 309 given proportion of infected animal holdings. In addition, they occurred at a much faster 310 rate compared to the distance dependent disease transmission scenarios (DlDt and RlDt) 311 (figure 3 and 5). Considering all the scenarios, the slowest transmission rate was found for 312 RlDt, i.e. random link sampling and distance dependent transmission. 313 314 Figure 6 shows a comparison of the four scenarios with regard to the number of infected 315 holdings at a given link density. At low link densities, all methods gave different results. 316 When link density increases, the two distance dependent disease transmission scenarios 14 317 (DlDt and RlDt) approach each other. As well as the two random disease transmission 318 scenarios (DlRt and RlRt) did. It can be seen that the higher the link density, the greater the 319 similarity between the results for the different distance dependent disease transmission 320 scenarios. As mentioned above, disease spread was much faster with random transmission 321 than with distance dependent transmission. 322 323 The average assortativity for the networks depended on the link creation method that was 324 used (figure 7a). Distance dependent link creation led to higher values of assortativity 325 compared to random link creation. As expected, the networks produced by random linking 326 had assortativity close to zero at all link densities. 327 328 The average clustering coefficient for all networks increased with increasing link density 329 (figure 7b). The clustering coefficients for the networks generated by distance dependent 330 link sampling were higher than the values for the networks made by random link creation. 331 When link density was increased, the random link sampling approached the distance 332 dependent link sampling. The networks generated by the random link sampling gave 333 clustering coefficients that were equal to the link density in question. Of course, the 334 clustering coefficient was one for all networks when the link density was one, and all 335 animal holdings were connected to each other. 336 337 For both link sampling scenarios, the fragmentation index for the networks was close to one 338 when link density was 0.001 (table 2). When we increased link density to 0.01, the 15 339 fragmentation index decreased dramatically. In both link sampling scenarios, the index 340 reached zero when link density was 0.03 or higher. 341 342 343 4. DISCUSSION 344 345 Our aim was to study the effects of using a disease transmission network with missing links 346 to predict the spread of disease. We investigated whether it is possible to predict anything 347 about the size of such dissemination using only a proportion of all theoretically possible 348 links. According to the results, a link density of 0.04 gave the same mean number of 349 infected animal holdings as a higher link density when spread of disease was simulated in a 350 scenario in which both the probability of identifying a link, and disease transmission, was 351 distance dependent (the DlDt scenario). Also, the variation between different realizations of 352 disease spread converged to expected variation at link density 0.04. When considering 353 distance dependent disease transmission and random link sampling (RlDt), as expected, the 354 numbers of infected animal holdings reached the same level as in the DlDt scenario. 355 Although most of the links were needed to attain that rate since the less probable (longer 356 distance) links will be included when using random sampling than with distance dependent 357 sampling. For random disease transmission (scenario DlRt and RlRt), the number of infected 358 holdings increased with increased link density, which implies that a much higher link 359 density is required to reach relevant approximations of spread of disease. The discussion 360 below addresses the implications of our results in relation to sampling procedures and the 361 effects of using networks with missing links. 16 362 363 Studies using empirical data have shown that only a small fraction of all possible 364 connections in a network actually occurs (Webb, 2006; Eames et al., 2009). When sampling 365 data, it is almost impossible to trace all connections between nodes, even if the number is 366 small, and this often leads to incomplete data sets. Therefore, it is important to consider link 367 density, or mean link degree, when modelling networks. If simulations in a scenario DlDt 368 network with a link density of 0.04 or higher are compared with simulations in a complete 369 network, both will result in the same mean number of infected holdings and the same 370 variation in that mean. This implies that a link density of 0.04 is sufficient and further 371 sampling is unnecessary. Another important issue to consider when using empirically 372 sampled networks is the time window for the sampling period. Using an “incorrect” time 373 window can lead to missing links or unnecessary sampling. The period chosen has an 374 impact on how complete the network will be: a longer time window can result in a more 375 connected network compared to one that is based on a very short time window. The lengths 376 of the time windows used in different studies have varied. For example, Kiss et al. (2006) 377 chose a four-week time scale in their investigation of sheep movements in Great Britain, 378 and Robinson and Christley (2007) used periods of 10 weeks to analyse animal transports. 379 Such studies may indeed provide a very good description of a network during the actual 380 time period under consideration, but that information may not be suitable for making 381 predictions. 382 383 Combining our results with the time frame of a study can help emphasize the problem and 384 focus on number of links that are actually measured. By definition, a shorter period will 17 385 result in fewer measured links, but the question is whether such a sample can suffice to test 386 for spread of disease over a period that is longer than the one that is actually measured. 387 Obviously, in the DlDt scenario, a link density of 0.04 is a guarantee for correct 388 measurement of disease transmission during any given time period. However, at a link 389 density below 0.02, our results indicate that measuring disease dissemination will be 390 erroneous even during short periods. By comparison, a density of 0.03 may hold until a 391 period comprising 50–100 time steps is reached, i.e. when the 0.03 curve diverges from the 392 curves for higher densities (see figure 3) and there is a large overlap in variation between 393 realizations (figure 4). These conclusions are true only when considering a perfect link 394 sampling procedure such as in the DlDt scenario. On the other hand, if the sample 395 procedure is not that perfect, even more links must be included. Our RlDt scenario 396 represents the opposite extreme, i.e. a complete random link sampling procedure that is in 397 no way related to the probability of contacts. In such a case, almost all links have to be 398 sampled, even to make estimations during short time periods. In real life, a link sampling 399 procedure fall somewhere in between these two extremes. Considering any time period or 400 sampling procedure, it is not recommendable to base analysis on link densities below 0.02. 401 Of course, this conclusion applies to our setup: how we modelled the spread of disease, 402 what distance dependence we used (eq. 1), the number and spatial configuration of the 403 holdings. Still our results show that to achieve reliable measurements, it may be necessary 404 to include a higher link density than expected. Link density is a measure that very much 405 depends on the node density; the more nodes the less link density will suffices. Yet, link 406 density is a relevant measure when determining how large proportion of all links is needed 407 to get a fair estimate of disease spread. On the other hand, mean degree is a more general 18 408 measure and our study show that at least a mean degree of 10 links is required. Our method 409 includes periodic boundaries which expel boundary effects and hence our results applies for 410 larger, to infinite, sets of nodes given that the configuration of nodes/animal holdings are 411 realized by our set up of 500 randomly distributed nodes. Hence, a mean degree of 10 links 412 should hold for larger sets of randomly distributed animal holdings although link density 413 consequently will decrease. Furthermore, our methodology can be applied to any specific 414 system, other spatial configurations, latency periods etc, to assess the necessary level of 415 link density. The link density can be achieved by making a single measurement over a 416 sufficiently long period of time or by conducting repetitive sampling over shorter time. 417 418 Empirical investigations of networks have shown that link densities is often very small, i.e. 419 merely a parts per thousand or a few per cent of the total number of theoretical connections 420 in the networks. An example of this is a study by Ortiz-Pelaez et al. (2006), which was 421 conducted to analyse animal movements during the initial phase of an epidemic of foot- 422 and-mouth disease that occurred in Great Britain in 2001. The network in that investigation 423 had a mean link degree of 1.22, which corresponds to a link density that is as low as about 424 0.0019. A low link density was also found in the Swedish animal transport network 425 (Nöremark et al., submitted). It is important to remember that the measured contacts in an 426 empirical network are simply subsets of realizations of all possible contacts, which means 427 that the number of links in such a network is in fact a subset of the links that have been 428 realized at the time of data collection. Actually, there are probabilities for a huge number of 429 additional connections, but they are not even realized during the chosen time period of a 430 network study. For instance, when a link density of 0.01 was used in our investigation, all 19 431 theoretical connections were possible but only 1% of them were realized, and those may 432 have differed between the replicates. When modelling virtual networks, it is also important 433 to consider the link density and mean degree. Kiss et al. (2005) performed epidemiological 434 modelling using virtual networks with mean degrees varying between 5 and 20. In their 435 results, there is an indication that a mean degree of 15-20 was enough to estimate final 436 epidemic size. Besides the investigation by Kiss et al. (2005) few other network studies 437 have focused on link densities and missing links. When comparing either different 438 theoretical studies or theoretical results with empirical investigations one has to use 439 relevant measures. In general, our study indicates that a mean link degree of at least 10 440 links is required and in empirical studies do have too few links in their estimates. The 441 results of Kiss et al. (2005) also support our findings. 442 443 Diseases will be spread faster by a network with randomly distributed links than by 444 clustered networks (Kiss et al., 2005; Watts and Strogatz, 1998). We generated such 445 random networks in our study when we applied random transmission probabilities (RlRt 446 and DlRt), and RlRt represents the full random scenario with rapid spread of disease. For a 447 scenario such as DlRt, with random transmissions and distance dependent link sampling, the 448 transmission rate is slightly slower at any given link density. The link sampling procedure 449 of DlRt erroneously assumes density-dependent contact, and yet the contact structure is 450 random. In a case like this, the rate of the real network, i.e. with random transmission 451 probabilities, is higher than the rate of the sampled network, since the link sampling 452 procedure will miss some important long distance links. Hence, in this mismatch between 453 sampling and transmission probabilities, even higher link density is necessary when 20 454 sampling to reach the correct levels of spread of disease (compare RlRt and DlRt). 455 Lindström et al. (2009) have shown that the spatial kernel explaining the distance 456 dependence of contacts between holdings due to transport is a mix of distance 457 independence, mass action mixing, and distance dependence. In our setup, the mass action 458 mixing is represented by Rt, and, once again, the reality is somewhere in between these two 459 extremes, the DlRt and the DlDt scenario. Consequently, our results for the RlRt and DlRt 460 scenarios imply that the link density levels of 0.03 and 0.04 that were found can be 461 expected to be too low, since the mass action component in contact structures creates even 462 higher demands on link density. 463 464 It is recognized that random networks have a low level of clustering compared to other 465 kinds of networks, such as small-world networks (Shirley and Rushton, 2005; Watts and 466 Strogatz, 1998). We measured the clustering coefficient for each of our networks, and, as 467 expected, found lower values for those generated by random sampling than for those 468 generated by distance dependent link sampling. The degree of fragmentation of a network 469 influenced the extent to which diseases could spread between the holdings. Fragmentation 470 index is a measure of the extent of disconnection of networks, and, in our study, only the 471 networks with link density below 0.03 resulted in disconnection. Link densities of 0.03 or 472 higher gave rise to connected graphs, indicating that it is possible for a disease to spread 473 between all animal holdings in these networks. Since we know that a link density of 0.03 474 corresponds to a mean link degree of almost 7.5, the values of the fragmentation index 475 seem reasonable. It is plausible that a disconnection in a network would reduce the spread 476 of the disease immensely, and hence any disconnection that is apparent after a link 21 477 sampling procedure should be scrutinized. If a disconnection is the result of a specific 478 realization and thus is not necessarily the same in any other realizations (i.e. new time 479 period), this will jeopardize any conclusions drawn from the study. This is evident 480 considering the observed variation, in our results, in the rate of spread for different link 481 densities (figure 4), which emphasizes the difference between a network that represents one 482 specific time period with all its measures and a network that can be used to predict and 483 estimate rate for any given time. 484 485 We were interested in determining how many animal holdings that could become infected 486 and the rate of disease transmission, and thus incubation time was not included in our 487 model. This is a simplification, because diseases differ with respect to incubation time, 488 which can vary from only a few days to as long as a number of years. However, our model 489 can easily be extended to encompass a more complex disease context by including a 490 recovery phase and incubation time. We calculated the number of infected animal holdings 491 as a measure of the spread of disease. In practice, this might not be particularly relevant, 492 because it is not desirable to allow disease transmission to proceed for such a long time. 493 Obviously, it would be preferable to adopt control strategies as soon as possible after 494 identifying an infection. Notwithstanding, the findings of our study do have implications 495 regarding what link density ought to be achieved when testing different strategies. 496 497 4.1 Conclusions 498 Our results indicate that to estimate network properties such as spread of disease, it might 499 be necessary to construct link sampling procedures that yield high link densities. More 22 500 specifically, our scenarios based on Swedish farms show that, if the sampling procedure is 501 ideal a density of 0.02 (mean degree of 5) can suffice to estimate disease transmission over 502 shorter time periods, whereas 0.04 (mean degree of 10) is required for longer periods. 503 Nevertheless, in reality, link sampling procedures are not perfect, and some mass-action 504 mixing component can be expected in the contacts between holdings. Our results 505 demonstrate that these two components of reality enforce an even higher level of link 506 density and thereby represent a relevant measure of spread of disease. 507 508 509 ACKNOWLEDGEMENTS 510 511 We would like to thank the Swedish Civil Contingencies Agency (MSB) for funding this 512 project. We also like to thank Patricia Ödman for revising the English. 513 514 515 REFERENCES 516 517 Barrat, A., Barthélemy, M., Pastor-Satorras, R., Vespignani, A., 2004. The architecture of 518 complex weighted networks. PNAS 101, 3747-3752. (doi:10.1073/pnas.0400087101) 519 520 Barthélemy, M., Barrat, A., Pastor-Satorras, R., Vespignani, A., 2005. Dynamic patterns of 521 epidemic outbreaks in complex heterogeneous networks. Journal of Theoretical Biology 522 235, 275-288. (doi:10.1016/j.jtbi.2005.01.011) 23 523 524 Borgatti, S., 2003. The Key Player Problem in Dynamic Social Network Modeling and 525 Analysis: Workshop Summery and papers, R. Breiger, K. Carley, P. Pattison, (Eds). 526 National Academy of Sciences Press. 527 528 Christley, R.M., Robinson, S.E., Lysons, R., French, N.P., 2005. Network analysis of cattle 529 movement in Great Britain. Proceedings of the Society for Veterinary Epidemiology and 530 Preventive Medicine (2005), 234-243. 531 532 Clauset, A., Moore, C., Newman, M.E.J., 2008. Hierarchical structure and the prediction of 533 missing links in networks. Nature 453, 98-101. (doi:10.1038/nature06830) 534 535 Corner, L.A.L., Pfeiffer, D.U., Morris, R.S., 2003. Social-network analysis of 536 Mycobacterium bovis transmission among captive brushtail possums (Trichosurus 537 vulpecula). Preventive Veterinary Medicine 59, 147-167. (doi:10.1016/S0167- 538 5877(03)00075-8) 539 540 Eames, K.T.D., Read, J.M., Edmunds, W.J., 2009. Epidemic prediction and control in 541 weighted networks. Epidemics 1, 70-76. (doi:10.1098/rspb.2003.2554) 542 543 Guimerà, R., Sales-Pardo, M., 2009. Missing and spurious interactions and the 544 reconstruction of complex networks. PNAS 106, 22073-22078. 545 (doi:10.1073/pnas.0908366106) 24 546 547 Heath, M.F., Vernon, M.C., Webb, C.R., 2008. Construction of networks with intrinsic 548 temporal structure from UK cattle movement data. BMC Veterinary Research 4:11. 549 (doi:10.1186/1746-6148-4-11) 550 551 Håkansson, N., Jonsson, A., Lennartsson, J., Lindström, T., Wennergren, U., 2010. 552 Generating structure specific networks. Advances in Complex Systems 13:2, 239-250. 553 (doi:10.1142/S0219525910002517) 554 555 Kao, R.R., Green, D.M., Johnson, J., Kiss, I.Z., 2007. Disease dynamics over very different 556 time-scales: foot-and-mouth disease and scrapie on the network of livestock movements in 557 the UK. J. R. Soc. Interface 4, 907-916. (doi:10.1098/rsif.2007.1129) 558 559 Keeling, M. 2005. The implication of network structure for epidemic dynamics. Theoretical 560 Population Biology 67, 1-8. (doi:10.1016/j.tpb.2004.08.002) 561 562 Kiss, I.Z., Green, D.M., Kao, R.R., 2005. Disease contact tracing in random and clustered 563 networks. Proc. R. Soc. B 272, 1407-1414. (doi:10.1098/rspb.2005.3092) 564 565 Kiss, I.Z., Green, D.M., Kao, R.R., 2006. The network of sheep movements within Great 566 Britain: network properties and their implications for infectious disease spread. J. R. Soc. 567 Interface 3, 669-677. (doi:10.1098/rsif.2006.0129) 568 25 569 Lindström, T., Håkansson, N., Westerberg, L., Wennergren, U., 2008. Splitting the tail of 570 the displacement kernel shows the unimportance of kurtosis. Ecology 89, 1784-1790. 571 (doi:10.1890/07-1363.1) 572 573 Lindström, T., Sisson, S.A., Nöremark, M., Jonsson, A. and Wennergren, U., 2009. 574 Estimation of distance related probability of animal movements between holdings and 575 implications for disease spread modeling. Preventive Veterinary Medicine 91, 85-94. 576 (doi:10.1016/j.prevetmed.2009.05.022) 577 578 Nöremark, M., Håkansson, N., Sternberg Lewerin, S., Lindberg, A. and Jonsson, A. 579 Network analysis of cattle and pig movements in Sweden: measures relevant for disease 580 control and risk based surveillance. Submitted to Preventive Veterinary Medicine. 581 582 Newman, M.E.J., Strogatz, S.H. and Watts, D.J., 2001. Random graphs with arbitrary 583 degree distributions and their applications. Phys. Rev. E 64, 026118. 584 (doi:10.1103/PhysRevE.64.026118) 585 586 Newman, M. E. J., 2002. Assortative mixing in networks. Phys. Rev. Lett. 89 (20). 587 (doi:10.1103/PhysRevLett.89.208701) 588 589 Ortiz-Pelaez, A., Pfeiffer, D.U., Soares-Magalhães, R.J., Guitian, F.J., 2006. Use of social 590 network analysis to characterize the pattern of animal movements in the initial phases of the 591 2001 foot and mouth disease (FMD) epidemic in the UK. Prev. Vet. Med. 76, 40-55. 26 592 (doi:10.1016/j.prevetmed.2006.04.007) 593 594 Perkins, S.E., Cagnacci, F., Straditto, A., Arnoldi, D., Hudson, P.J., 2009. Comparison of 595 social networks derived from ecological data: implications for inferring infectious disease 596 dynamics. Journal of animal ecology 78, 1015-1022. (doi:10.1111/j.1365- 597 2656.2009.01557.x) 598 599 Robinson, S.E., Christley, R.M. 2007. Exploring the role of auction markets in cattle 600 movements within Great Britain. Preventive Veterinary Medicine 81, 21-37. 601 (doi:10.1016/j.prevetmed.2007.04.011) 602 603 Shirley, M.D.F., Rushton, S.P. 2005. The impacts of network topology on disease spread. 604 Ecological Complexity 2, 287-299. (doi:10.1016/j.ecocom.2005.04.005) 605 606 Vernon, M.C., Keeling, M.J., 2009. Representing the UK´s cattle herd as static and 607 dynamic networks. Proc. R. Soc. B 276, 469-476. (doi:10.1098/rspb.2008.1009) 608 609 Wasserman , S., Faust, K., 1994. Social Network Analysis: Methods and Applications. 610 Cambridge University Press, Cambridge. 611 612 Watts, D.J., Strogatz, S.H., 1998. Collective dynamics of ‘small-world’ networks. Nature 613 393, 440-442. (doi:10.1038/30918) 614 27 615 Webb, C.R., 2005. Farm animal networks: unraveling the contact structure of the British 616 sheep population. Preventive Veterinary Medicine 68, 3-17. 617 (doi:10.1016/j.prevetmed.2005.01.003) 618 619 Webb, C.R., 2006. Investigating the potential spread of infectious diseases of sheep via 620 agricultural shows in Great Britain. Epidemiology and Infection 134, 31-40. 621 (doi:10.1017/S095026880500467X) 622 623 624 28 625 TABLE CAPTIONS 626 627 Table 1. Link densities used in simulations and the corresponding mean link degree for the 628 networks. 629 630 631 Table 2. Fragmentation index according to link density and the link sampling method used. 632 29 633 FIGURE CAPTIONS 634 635 Figure 1. Network categories: (a) complete network, (b) real-world network, (c) sampled network. 636 637 Figure 2. Model flow chart. Flow chart showing relationships between the different components of 638 the model. 639 640 Figure 3. Mean number of infected holdings per time step in the four linking and disease 641 transmission scenarios. Disease transmission was distance dependent in scenarios DlDt (a) and RlDt 642 (b) but random in DlRt (c) and RlRt (d). Also, distance dependent link creation was applied in DlDt 643 (a) and DlRt (c), whereas links were generated randomly in RlDt (b) and RlRt (d). The link densities 644 were as follows: 0.001 (---), 0.005 (…), 0.01 (--.--), 0.02 (__), 0.03 (-○-), 0.04 (-*-), 0.05 (-□-), 0.1 645 (-♦-), 0.25 (-◦-), 0.5 (-▼-), 0.75 (-x-) and 1.0 (-+-). Corresponding mean link degrees can be found 646 in table 1. 647 648 Figure 4. The median values of the 1000 replicates of the simulations of the DlDt scenario plotted 649 with the first and third quartile on each side on the median curve. The solid line shows the median 650 number of infected holdings per time step and the dashed lines represent the first and third quartiles 651 of the replicates. Link densities: (a) 0.001, (b) 0.01, (c) 0.02, (d) 0.03, (e) 0.04, (f) 1.0. Note that the 652 scales of the y-axes differ in (a) and (b). Corresponding mean link degrees can be found in table 1. 653 654 Figure 5. Number of time steps passed before 10% (a), 50% (b) and 90% (c) of all holdings in the 655 network were infected. The time depended on which of the four scenarios was used. The scenarios 656 are designated as follows: dashed line, DlDt ; dotted line, RlDt; solid line, DlRt; dash-dot line, RlRt. 30 657 For scenario RlDt, the number of infected holdings did not reach any of the given proportions during 658 the simulation time. 659 660 Figure 6. Mean number of infected holdings per time step for a given link density and the four 661 scenarios, designated as follows: dashed line, DlDt ; dotted line, RlDt; solid line, DlRt; dashed-dotted 662 line, RlRt. Link densities: (a) 0.001, (b) 0.01, (c) 0.03, (d) 0.05, (e) 0.07, (f) 0.1, (g) 0.5, (h) 1.0. 663 Note that the scales of the y-axes differ in (a). Corresponding mean link degrees can be found in 664 table 1. 665 666 Figure 7. Average assortativity (a) and clustering coefficient (b) illustrated for the networks 667 according to the connections of the holdings. Distance dependent linking is indicated by a dashed 668 line and random linking by a solid line. 31