Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
National Health Research Institutes Space-Time Modeling and Application to Emerging Infectious Diseases 李正宇 July 26th, 2005 Division of Biostatistics and Bioinformatics 2005/4/12 1 Outline • Introduction • STARMA Models • Methods for STARMA Modeling and Software IEAST • Modeling Emerging Infectious Diseases using STARMA and IEAST • Conclusion 2 2005/4/12 Introduction 3 2005/4/12 Introduction Tobler’s First Law of Geography ‘‘Everything is related to everything else, but near things are more related than distant things.’’ 4 2005/4/12 Introduction • Biological and ecological processes are often organized and correlated in both space and time. • Why use space-time data and space-time analyses? • Various space-time models – STKF, KKF, VARMA, STARMA, etc. • Why STARMA models? • Is emerging infectious diseases the only application? 5 2005/4/12 Scope of the Work • An efficient and robust STARMA modeling method – Space-time extensions of optimization algorithm and model fitness measures – Refinement of the space-time modeling procedure • Software development -- IEAST – The first general-purpose STARMA modeling and analysis software – Integrated Environment for Analyzing STARMA models • Application to the spread of WNV in an epidemic in Detroit – – – – Modeling and analysis of Dead Crow Data Modeling and analysis of Human Case Data Cross analysis of Human Case Data and Dead Crow Data Statistical inferences from these space-time analyses 6 2005/4/12 STARMA Models 7 2005/4/12 Space-Time Variables Evolving over Time • zt,x : some ecological variable at spatial coordinates vector x at time t. zx forms a time series for location x. zt,(2,2) zt,(1,2) zt,(0,0) zt,(2,1) random noise time = t • These time series are not independent, but influence each other via spatial proximity. time = t-1 time Y X 8 2005/4/12 General STARMA Models • The general STARMA model has the stochastic equation: zt ,x k ,b zt k ,xb k ,b et k ,xb et ,x k b k b |----- AR terms -----| |----- MA terms -----| The strengths of the autoregressive components is measured by k,b and the strengths of shared moving average stochastic inputs are k,b. • Model types: – – – STAR model (when k,b=0) STMA model (when k,b=0) Mixed model (when k,b 0 and k,b 0). 9 2005/4/12 A Useful Form for STARMA Modeling • By introducing the spatial weight matrices W(l), we can express the general STARMA model as the following form: z t kl W (l ) z t k kl W (l )et k et k where l k l l : spatial lag, k : temporal lag; zt is the observation vector at time t; W(l) is the weight matrix for l-th order; kl are the parameters of autoregressive terms; kl are the parameters of moving average terms; et is the random noise vector at time t. • This is the equation actually used for the implementation of IEAST and applications. 10 2005/4/12 Spatial Correlation Structure and Weight Matrices • Spatial weight matrices are used to construct the spatial correlation structure among locations. • The following ordering is an example of the definition of spatial correlation structure (up to 4th order neighbors) in 2D system. 1st order, W(1) 2nd order, W(2) 3rd order, W(3) 4th order, W(4) 11 2005/4/12 Some Limitations of STARMA Modeling • Raster based • Requires massive amount of space-time data • Models generally may not be fully mechanistic Assumptions: • Stationarity • “Spatial Regularity” • Effects are “constant” • Effects are “linearly” correlated 12 2005/4/12 Methods for STARMA Modeling and Software IEAST 13 2005/4/12 Box-Jenkins Modeling Method Data Model Identification Modify Model Parameter Estimation Diagnostic Check No Good? Yes End 2005/4/12 14 Model Identification To determine the model type and orders. • Conventionally, space-time autocorrelations (i.e. STACF/STPACF) are used (Pfeifer and Deutsch, 1980). • In this research, space-time extensions of model fitness measures (i.e. AIC, BIC) are used to assist identification when the method above does not work. These measures are more objective and computationally efficient. 15 2005/4/12 Model Identification— using Space-Time Autocorrelation Functions Example 1: STAR (MaxT=2, MaxS=1) – – STACF tails-off STPACF cuts-off at T-lag=2 & S-lag=1 STACF STPACF STACF STPACF Example 2: STMA (MaxT=1, MaxS=1) – – 2005/4/12 STACF cuts-off at T-lag=1 & S-lag=1 STPACF tails-off Suggested Model Type STACF STPACF Tail-off Cut-off STAR model Cut-off Tail-off STMA model Tail-off Tail-off Mixed model 16 Model Identification— using Space-Time Autocorrelation Functions STACF Simulation Data 1 Based on a STAR process Tail-off STPACF Cut-off zt = 0.50zt-1 + 0.30W(1) zt-1 + 0.10zt-2 + 0.05W(1) zt-2 + et STACF Simulation Data 2 Based on a STMA process zt = et -(-0.6)et-1 -(-0.4)W (1) +e STPACF Cut-off Tail-off t-1 17 2005/4/12 Model Identification— using Model Fitness Measures Accuracies (number in red) of model type selection using (1)Variance of residuals, (2)AIC, (3)BIC, and (4)–AIC*BIC based on 150 Monte Carlo simulated datasets: Using Variance of Residuals Datasets based on Model type identified STAR STAR 4% 0% STMA 4% Mixed 8% Using AIC Datasets based on STAR 96% STAR 16% 0% 84% 6% 90% STMA 4% 6% 90% 6% 90% Mixed 8% 2% 90% STMA Mixed Model type identified Using BIC 2005/4/12 Datasets based on STAR STAR 100% 0% STMA 4% Mixed 18% STMA Mixed Using -AIC*BIC Model type identified Datasets based on STAR 0% STAR 100% 0% 0% 86% 10% STMA 4% 78% 18% 16% 66% Mixed 16% 4% 80% STMA Mixed Model type identified STMA Mixed 18 Parameter Estimation To calculate coefficients of a candidate model for given model type and orders. • Two methods needed for two kinds of models: – Linear models (i.e. STAR) : Linear ML estimator. – Non-linear models (i.e. STMA and Mixed) : Multi-variate nonlinear optimization. • The multi-variate and non-linear nature raises problems while in optimization : – Converge to local optima – Very time-consuming • A good starting point is crucial for optimization – Extra step ‘Pre-estimation’ – Space-time extended Hannan-Rissanen Algorithm is used. 2005/4/12 19 Diagnostic Check • To decide the adequacy of a candidate model for representing the given data. • Methods: – – – – Variance of residuals Space-time autocorrelations of residuals Significance testing of parameters Space-time extension of AIC/BIC 20 2005/4/12 Modeling Procedures Data Model Identification Modify Model Parameter Estimation Diagnostic Check No Good? Yes End Box-Jenkins method 2005/4/12 21 Software for STARMA Modeling -- IEAST • Developed using GNU Octave v2.1.40 and able to be used under various popular OS, e.g. MS Windows, Mac OS, Unix. • Two interfaces: menu-driven mode and programming mode. • Features: – – – – – – – – – 2005/4/12 True spatio-temporal analysis software Analyzing 2D lattice space-time datasets Full configurability Programming environment Improved estimation algorithms Improved diagnostic measures Estimation of spatial correlation structure Cross correlation analysis 2D/3D plotting abilities 22 IEAST — Menu-Driven Mode vs Programming Mode In menu-driven mode, users can conduct the modeling procedure by selecting a series of commands/options from the menu hierarchy. [IEAST v1.30.01 - STARMA Modeling & Analysis] ========= [ Data Preprocessing ] ========= [ 1] > Remove Mean =============== [ Main Menu ] =============== [ 2] > De-seasonalize: (1-B^dd)Z(t) [ 1] Setup ========== [ [ 3]Correlation > DiferenceAnalyses by one: ] (1-B)Z(t) ========== [ 2] Data Preprocessing [ 1] > AutoCorrelation [ 4] > De-trend(STACF) [ 3] Correlation Analyses [ 2] > Partial [ 5] > AutoCorrelation -----(STPACF) [ 3] > Cross [ 6] Correlation > Subsequencing/Resampling (STXCF) [ 4] Model Identification [ 4] > Partial [ 7] > Cross Smoothing Correlation (STPXCF) [ 5] Parameter Estimation ============================================== [ 5] > Extended [ 8] > Missing Cross Correlation Data (ExtSTXCF) [ 6] Diagnostic Analysis [ Model [ 6] Identification > Plot [ 9]Correlations > Filter ] with versus a given T-Lag/S-Lag STARMA model [ 7] -----[ 1] [ Automatic 7] > Return [10]Identification > Undo previous (Type,Orders) action [ 8] Preference [ 2] ============================================== Artificial [11] Identification > Return (Type,Orders) =================== [ Parameter Estimation ] =================== [ 3] Parameter ========================================== Masking [ 9] Interpreter [ 1] > Pre-estimate Model Param -- Linear (STAR) [ 4] -----[10] Exit [ 2] > Pre-estimate Model Param -- Non-linear (STMA,STARMA) [ 5] Return ============================================= [ 3] > Pre-estimate Model Param -- From STACF/STPACF ============================================== ====[[4] Diagnostic Analysis ] ==== > Pre-estimate Model Param -- Specified by users [ 1] > Statistical Significance [ 5] > Estimate Model Param -- Fixed SRM ============== [ Setup ] ============== [ 2] Analysis [ > 6]AICC/BIC > Estimate SRM -- Fixed Model Param [ 1] > Space-time dataset [ 3] of Residuals [ > 7]STACF > Estimate SRM & Model Param -- Alternatively [ 2] > Spatial correlation structure # list [ 4] of Residuals [ > 8]STPACF > Return [ 3] > Information of datasets 10 load data[demo.dat 5] > -----================================================================ [ 4] > Return 20 load weight uniform.wet [ 6] > Return ======================================= 30 stacf ST_ACF Z 16 3 ================================= 40 plotacf ST_ACF 16 3 "ACF" 23 : : : 2005/4/12 IEAST — Menu-Driven Mode vs Programming Mode In programming mode, a set of sophisticated instructions can be used to compose programs to control the modeling flow and to conduct statistical analyses. Space-time Dataset: ‘demo.dat’ # name: DatafileZ # type: matrix # rows: 100 # columns: 100 -0.0350001 0.00197952 -0.00635348.... -0.0886448 0.0504684 -0.00369402.... 0.025101 0.00844576 -0.00743455.... ………………….. Spatial Weighting Matrices: ’uniform.wet’ # name: SOD # type: global matrix # rows: 21 # columns: 21 0 0 0 0 0 0 0 0…. 0 0 0 0 0 0 0 0…. ………………. IEAST Program ‘demo.pgm’ 10 load data demo.dat 20 load weight uniform.wet 30 stacf STACF Z 16 3 ……. 2005/4/12 [IEAST v1.30.01 - STARMA Modeling & Analysis] =============== [ Main Menu ] ============= [ 1] Setup : : [ 8] Preference [ 9] Interpreter [10] Exit ============================================= ============================================= || Welcome to STARMA analyzing interpreter || ============================================= # load program demo.pgm # list 10 load data demo.dat 20 load weight uniform.wet 30 stacf STACF Z 16 3 ……… 100 end # run 24 Modeling Emerging Infectious Diseases using STARMA and IEAST 25 2005/4/12 State of Art for Statistical Analyses of Emerging Infectious Diseases As far as we know, no true spatial-temporal statistical models and methods have been used. • Space-time cluster analysis available (Theophilides et al, 2003; Mostashari et al, 2003; Hoebe et al, 2004) • Spatial models available (Watson et al, 2004). • Temporal models available. 26 2005/4/12 Limitations of Simply Observing How a Spatial Distribution Changes over Time • For example, expansion of the leading edge of a disease range. • Is the disease spreading directly over long distances but infrequently, or over short distances frequently? • This is important for projecting the future spread. 27 2005/4/12 STARMA Has Potential for the Early Characterization of Infectious Diseases. • STARMA acts as a “prism”. Can filter the spatialtemporal correlations into direct effects with known magnitude and spatial and temporal lags. • Not generally a complete, mechanistic model, but puts critical constraints on models. 28 2005/4/12 West Nile Virus The West Nile Virus (WNV) was first detected in a woman with a mild fever in the West Nile District of Uganda in 1937. Since then WNV has been spreading to North Africa, Europe, West and Central Asia, and the Middle East. 29 2005/4/12 West Nile Virus in the United States • Outbreak in NYC in Sep 1999. Vector is Culex mosquitoes. • Wild birds (89% are American crows) are the principal hosts. Humans, horses, etc. are incidental hosts. • The incidence rate among crows is high. Infected crow almost always die (68%). • Surveillance of Dead crows has been used as an indicator of WNV epidemic. (A figure from CDC web site) 2005/4/12 30 Dead Crow Data (DCD) & Human Case Datasets (HCD) in 2002 Time: Summer in 2002 (April~October) Place: Detroit metro area (Oakland, Macomb, and Wayne) • DCD were collected systematically before and during an outbreak among humans. Data mainly consisted of locations and dates of reported public sightings. • HCD were obtained from clinicians in Michigan. Data on address of residence and date of onset of disease were obtained from the case-patient or attending physician through telephone interviews. 31 2005/4/12 Two Datasets Collected in 2002 Human Cases Interview GIS - ArcMap Toll-free # Dead Crows* Longitude/Latitude WWW pages * From www.rci.rutgers.edu/ ~insects/crowid.htm 2005/4/12 Data Cleaning & Geocoding 32 Space-Time Analysis for Dead Crow Data 33 2005/4/12 The Dead Crow Data • Totally, 1817 dead crow sightings scattered within the three counties (red lines), spanning 28 weeks. • Covered area (after truncation): a rectangular area of 31.6x25.8 mi 34 • Divide the covered area into 10x10 cells. Cell size: 3.16x2.58mi 2005/4/12 Spatial Correlation Structure and Trends Spatial correlation structure (uniform weighting) * 6 5 5 5 6 * 6 5 5 5 6 * 5 4 3 4 5 6 4 2 1 2 4 5 3 1 0 1 3 5 4 2 1 2 4 5 5 4 3 4 5 6 6 5 5 5 6 * Preprocessing – Remove spatio-temporal trend • Spatial trend: 4th order polynomial regression trend surface • Temporal trend: averaging over space. – Remove mean 35 2005/4/12 Model Identification — STACF Tail-off STACF tails-off 2005/4/12 36 Model Identification — STPACF Spatially cut-off after this lag Temporally cut-off after this lag 2005/4/12 The STACF/STPACF suggest the model – STAR(maxT=3, maxS=4). 37 Parameter Estimation The parameters (ts) of this STAR model can be estimated in IEAST by linear maximum likelihood estimator. ts s0 s 1 s2 s3 s4 t 1 0.26 0.36 0.10 0.09 0.04 t2 0.04 0.18 0.07 0.04 0.11 t 3 0.02 0.11 0.02 0.02 0.04 • Values in dark blue are nominally significant at the 0.001 level. • Values in light blue are nominally significant at the 0.01 level. 38 2005/4/12 Diagnostic Check • Statistical significance of parameters – The probabilities P that ts are not significant are: P s0 s 1 s2 t 1 t2 t 3 0.001 0.001 0.01 0.04 0.001 0.04 0.4 0.01 0.25 s3 s4 0.1 0.3 0.9 0.4 0.01 0.6 • Residual’s autocorrelations STACF STPACF 39 2005/4/12 Interpretations for the DCD Analysis • STAR(3,4) model is the best-fitted one. • The max. of spatial and temporal lags that are important are still smaller. S=2 (or 6.4 km) and T=2 weeks. • Compare S=1 to S=2. Value for S=1 is much larger—cell boundary length effects. • The virus is not spreading very far very fast. Crows are not much spreading the virus spatially, though they probably are amplifying it locally. • Negative Autoregressive Effect At S=1, and T=2,3. – Appears to be a real effect. – May be due to crow population depletion. – Suggests there is a mixture of two STAR processes, the dominant one reflecting probability of infection, the other an echo effect from depletion. ts s0 s 1 s2 s3 s4 t 1 0.26 0.36 0.10 0.09 0.04 t2 0.04 0.18 0.07 0.04 0.11 t 3 0.02 0.11 0.02 0.02 0.04 40 2005/4/12 Additional Analyses and Results Additional Analyses: • Using 20x20 and other cell configurations • Using different lag structures “Pfeiffer’s” vs. “Ring structure” • Using various polynomials for Spatial de-trending • Using sub-sample of the data Results: • Consistent over various methods of spatial de-trending, except high order polynomials resulted in smaller AR. • Consistent AR values using different lag structures and cell sizes. • Consistent implied spatial and temporal scales over which there are significant or substantial AR effects 41 2005/4/12 Distances for Which There Are Significant Spatial Correlation • Based on different cell configurations: 10x10, 16x16, and 20x20 – The effective correlated area in the modeling result is consistently about 10.75 km regardless of cell sizes. Configurations Cell sizes Max S order of the estimated model Equivalent distances 10 x10 5.08x4.15km 4 10.99 km 16 x 16 3.19x2.59km 6 10.88 km 20 x 20 2.54x2.08km 7 10.38 km 42 2005/4/12 Alternative Spatial Correlation Structures * 6 5 5 5 6 * 6 5 5 5 6 * 5 4 3 4 5 6 4 2 1 2 4 5 3 1 0 1 3 5 4 2 1 2 4 5 5 4 3 4 5 6 6 5 5 5 6 * Pfeifer’s 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 3 2 1 1 1 2 3 2 1 0 1 2 3 2 1 1 1 2 3 2 2 2 2 2 3 3 3 3 3 3 3 Ring structure 43 2005/4/12 Space-Time Analysis for Human Case Data 44 2005/4/12 Human Case Data • Over 500 human cases spanning 13 weeks • Date of onset-converted to week • Home addresses (names stripped)-converted to “cell,” same as for DCD. • Used same arrays of cell sizes and spatial correlation structures as for DCD. • Same spatial and temporal detrending method 45 2005/4/12 Model Identification — STACF 46 2005/4/12 Model Identification — STPACF 47 2005/4/12 Parameter Estimation Temporal lags (weeks) Spatial lags s=0 s=1 t=1 0.26 0.06 -0.10 -0.29 -0.05 -0.30 -0.60 t=2 0.12 0.27 t=3 0.07 0.10 -0.15 0.05 0.00 0.06 -0.01 t=4 0.04 -0.17 -0.07 -0.02 0.16 0.25 0.11 t=5 -0.01 -0.10 -0.04 0.10 -0.06 0.11 0.06 t=6 -0.04 0.03 -0.03 -0.19 -0.09 0.08 s=2 s=3 s=4 s=5 s=6 0.13 -0.12 -0.11 -0.22 -0.11 0.09 • Values in dark blue are nominally significant at the 0.001 level. • Values in light blue are nominally significant at the 0.01 level. 2005/4/12 48 Diagnostic Check • Residual’s STACF and STPACF STACF STPACF 49 2005/4/12 Interpretations for the HCD Analysis • • • • Most people are getting infected at or near their homes. The incidences are highly autocorrelated in space and time. The distribution or probability of infection is highly “localized”. The WNV “load” and probability of human infection is “spreading” slowly, in the sense of not spreading very far very fast. • Suggests localized spraying could reduce cases. • Without depletion effect, the human case data show positive and significant above zero for T-lag=2 and S-lag>=1, esp. at S-lag=1. 2005/4/12 s=0 s=1 s=2 t=1 0.26 0.06 -0.10 t=2 0.12 0.27 t=3 0.07 0.10 -0.15 0.13 50 Space-Time Cross Analysis for HCD and DCD 51 2005/4/12 Space-Time Data HCD and DCD • The areas for cross analysis are same for both datasets. • The configuration is again 10x10 and spanning 28 weeks. • Cell size is 6.31x6.31 km. 52 2005/4/12 Both Temporal Epidemic Curves Dead crow reported is leading human cases in time. 2005/4/12 53 Space-Time Cross Correlations -3 2005/4/12 54 Interpretations for Space-Time Cross Correlations • Drop smoothly to zero spatially and temporally. • Very large (as high as 0.7). • Across all spatial lags, the max. cross correlations are aligned at –3 weeks. • The cross correlations at spatial lag 1 is slightly greater than at spatial lag 0. • When temporal lag decreases to –8 or below, the correlations between these two datasets are negligible (<0.1). • When spatial lag increases up to 10, the cross correlations are reduced to as low as 0.2. 55 2005/4/12 Is the Cross Correlations Spurious? The autocorrelation of the DCD can spuriously contribute to cross correlations. To eliminate this effect, both datasets were pre-whitened before calculating cross correlations. Cross correlation with pre-whitening • The result shows that the ‘real’ cross correlations are much larger than the ‘spurious’ components. 56 2005/4/12 Summary for Modeling the Spread of WNV • Crows are not spreading the disease spatially very far very fast. • Spread is very localized, perhaps other animals or the mosquitoes themselves are spreading it spatially. • Humans are being infected largely at or near their homes. • Both crows and humans appear to be responding to local viral loads. • Dead crow findings precede human cases by two to three weeks. Dead crows can be a good indicator of human epidemics. 57 2005/4/12 Conclusion • It appears that STARMA modeling could be an important tool of the early characterization of many emerging and reemerging infectious disease epidemics. • During the course of an epidemic, it could be used (in principle) for forecasting, under existing conditions or under potential courses of action. • While not generally a mechanistic model, STARMA does inform spatial and temporal scales of spread, hence places constraints on mechanistic models (which otherwise may have too many parameters). 58 2005/4/12 Funding Acknowledgements • Michigan Agricultural Experiment Station, Michigan State University. • Center for Emerging Infectious Diseases, Michigan State University. • Centers for Disease Control and Prevention, USA. 59 2005/4/12 Thanks for your attention! & Questions? 60 2005/4/12 References • • • • C.J.P.A. Hoebe, H. de Melker, L. Spanjaard, J. Dankert, and N. Nagelkerke. Spacetime cluster analysis of invasive meningococcal disease, Emerging Infectious Disease, Vol.10, No. 9, p1621-1626, 2004. C.N. Theophilides, S.C. Ahearn, S. Grady, and M. Merlino. Identifying West Nile virus risk areas: The dynamic continuous-area space-time system. American Journal of Epidemiology, 157:843-854, 2003. J. Watson, R. Jones, K. Gibbs, and W. Paul. Dead crow reports and location of human West Nile virus cases, Chicago, 2002. Emerging Infectious Diseases, 10(5):938-940, 2004. F. Mostashari, M. Kulldorff, J.J. Hartman, J.R. Miller, V. Kulasekera. Dead bird clustering: A potential early warning system for West Nile virus activity. Emerging Infectious Diseases, 9:641-646, 2003. 61 2005/4/12