Download Space-Time Modeling and Application to Emerging Infectious

Document related concepts
no text concepts found
Transcript
National Health Research Institutes
Space-Time Modeling and
Application to Emerging
Infectious Diseases
李正宇
July 26th, 2005
Division of Biostatistics and Bioinformatics
2005/4/12
1
Outline
• Introduction
• STARMA Models
• Methods for STARMA Modeling and
Software IEAST
• Modeling Emerging Infectious Diseases
using STARMA and IEAST
• Conclusion
2
2005/4/12
Introduction
3
2005/4/12
Introduction
Tobler’s First Law of Geography
‘‘Everything is related to everything else,
but near things are more related than
distant things.’’
4
2005/4/12
Introduction
• Biological and ecological processes are often organized
and correlated in both space and time.
• Why use space-time data and space-time analyses?
• Various space-time models
– STKF, KKF, VARMA, STARMA, etc.
• Why STARMA models?
• Is emerging infectious diseases the only application?
5
2005/4/12
Scope of the Work
• An efficient and robust STARMA modeling method
– Space-time extensions of optimization algorithm and model fitness
measures
– Refinement of the space-time modeling procedure
• Software development -- IEAST
– The first general-purpose STARMA modeling and analysis software
– Integrated Environment for Analyzing STARMA models
• Application to the spread of WNV in an epidemic in Detroit
–
–
–
–
Modeling and analysis of Dead Crow Data
Modeling and analysis of Human Case Data
Cross analysis of Human Case Data and Dead Crow Data
Statistical inferences from these space-time analyses
6
2005/4/12
STARMA Models
7
2005/4/12
Space-Time Variables Evolving over Time
• zt,x : some ecological variable at
spatial coordinates vector x at
time t. zx forms a time series for
location x.
zt,(2,2)
zt,(1,2)
zt,(0,0)
zt,(2,1)
random noise
time = t
• These time series are not
independent, but influence
each other via spatial proximity.
time = t-1
time
Y
X
8
2005/4/12
General STARMA Models
•
The general STARMA model has the stochastic equation:
zt ,x  k ,b zt k ,xb   k ,b et k ,xb  et ,x
k
b
k
b
|----- AR terms -----| |----- MA terms -----|
The strengths of the autoregressive components is measured by k,b
and the strengths of shared moving average stochastic inputs are k,b.
•
Model types:
–
–
–
STAR model (when k,b=0)
STMA model (when k,b=0)
Mixed model (when k,b  0 and k,b  0).
9
2005/4/12
A Useful Form for STARMA Modeling
• By introducing the spatial weight matrices W(l), we can
express the general STARMA model as the following form:
z t  kl W (l ) z t k   kl W (l )et k  et
k
where
l
k
l
l : spatial lag, k : temporal lag;
zt is the observation vector at time t;
W(l) is the weight matrix for l-th order;
kl are the parameters of autoregressive terms;
kl are the parameters of moving average terms;
et is the random noise vector at time t.
• This is the equation actually used for the implementation
of IEAST and applications.
10
2005/4/12
Spatial Correlation Structure and Weight
Matrices
•
Spatial weight matrices are used to construct the spatial
correlation structure among locations.
•
The following ordering is an example of the definition of
spatial correlation structure (up to 4th order neighbors) in
2D system.
1st order, W(1)
2nd order, W(2)
3rd order, W(3)
4th order, W(4)
11
2005/4/12
Some Limitations of STARMA Modeling
• Raster based
• Requires massive amount of space-time data
• Models generally may not be fully mechanistic
Assumptions:
• Stationarity
• “Spatial Regularity”
• Effects are “constant”
• Effects are “linearly” correlated
12
2005/4/12
Methods for STARMA Modeling
and Software IEAST
13
2005/4/12
Box-Jenkins Modeling Method
Data
Model
Identification
Modify Model
Parameter
Estimation
Diagnostic Check
No
Good?
Yes
End
2005/4/12
14
Model Identification
 To determine the model type and orders.
• Conventionally, space-time autocorrelations (i.e.
STACF/STPACF) are used (Pfeifer and Deutsch, 1980).
• In this research, space-time extensions of model fitness
measures (i.e. AIC, BIC) are used to assist identification
when the method above does not work. These measures are
more objective and computationally efficient.
15
2005/4/12
Model Identification—
using Space-Time Autocorrelation Functions
Example 1: STAR (MaxT=2, MaxS=1)
–
–
STACF tails-off
STPACF cuts-off at T-lag=2 & S-lag=1
STACF
STPACF
STACF
STPACF
Example 2: STMA (MaxT=1, MaxS=1)
–
–
2005/4/12
STACF cuts-off at T-lag=1 & S-lag=1
STPACF tails-off
Suggested
Model Type
STACF
STPACF
Tail-off
Cut-off

STAR model
Cut-off
Tail-off

STMA model
Tail-off
Tail-off

Mixed model
16
Model Identification—
using Space-Time Autocorrelation Functions
STACF
Simulation Data 1
Based on a
STAR process
Tail-off
STPACF
Cut-off
zt = 0.50zt-1 + 0.30W(1) zt-1 +
0.10zt-2 + 0.05W(1) zt-2 + et
STACF
Simulation Data 2
Based on a
STMA process
zt = et -(-0.6)et-1
-(-0.4)W (1) +e
STPACF
Cut-off
Tail-off
t-1
17
2005/4/12
Model Identification—
using Model Fitness Measures
Accuracies (number in red) of model type selection using
(1)Variance of residuals, (2)AIC, (3)BIC, and (4)–AIC*BIC
based on 150 Monte Carlo simulated datasets:
Using Variance of Residuals
Datasets
based on
Model type identified
STAR
STAR
4%
0%
STMA
4%
Mixed
8%
Using AIC
Datasets
based on
STAR
96%
STAR
16%
0%
84%
6%
90%
STMA
4%
6%
90%
6%
90%
Mixed
8%
2%
90%
STMA Mixed
Model type identified
Using BIC
2005/4/12
Datasets
based on
STAR
STAR
100%
0%
STMA
4%
Mixed
18%
STMA Mixed
Using -AIC*BIC
Model type identified
Datasets
based on
STAR
0%
STAR
100%
0%
0%
86%
10%
STMA
4%
78%
18%
16%
66%
Mixed
16%
4%
80%
STMA Mixed
Model type identified
STMA Mixed
18
Parameter Estimation
 To calculate coefficients of a candidate model for given
model type and orders.
• Two methods needed for two kinds of models:
– Linear models (i.e. STAR) : Linear ML estimator.
– Non-linear models (i.e. STMA and Mixed) : Multi-variate nonlinear
optimization.
• The multi-variate and non-linear nature raises problems
while in optimization :
– Converge to local optima
– Very time-consuming
• A good starting point is crucial for optimization
– Extra step ‘Pre-estimation’
– Space-time extended Hannan-Rissanen Algorithm is used.
2005/4/12
19
Diagnostic Check
• To decide the adequacy of a candidate model for
representing the given data.
• Methods:
–
–
–
–
Variance of residuals
Space-time autocorrelations of residuals
Significance testing of parameters
Space-time extension of AIC/BIC
20
2005/4/12
Modeling Procedures
Data
Model
Identification
Modify Model
Parameter
Estimation
Diagnostic Check
No
Good?
Yes
End
Box-Jenkins method
2005/4/12
21
Software for STARMA Modeling -- IEAST
• Developed using GNU Octave v2.1.40 and able to be used
under various popular OS, e.g. MS Windows, Mac OS, Unix.
• Two interfaces: menu-driven mode and programming mode.
• Features:
–
–
–
–
–
–
–
–
–
2005/4/12
True spatio-temporal analysis software
Analyzing 2D lattice space-time datasets
Full configurability
Programming environment
Improved estimation algorithms
Improved diagnostic measures
Estimation of spatial correlation structure
Cross correlation analysis
2D/3D plotting abilities
22
IEAST —
Menu-Driven Mode vs Programming Mode
In menu-driven mode, users can conduct the modeling
procedure by selecting a series of commands/options from the
menu hierarchy.
[IEAST v1.30.01 - STARMA Modeling & Analysis]
========= [ Data Preprocessing ] =========
[ 1] > Remove Mean
=============== [ Main Menu ] ===============
[ 2] > De-seasonalize: (1-B^dd)Z(t)
[ 1] Setup
==========
[ [
3]Correlation
> DiferenceAnalyses
by one: ]
(1-B)Z(t)
==========
[ 2] Data Preprocessing
[ 1] > AutoCorrelation
[ 4] > De-trend(STACF)
[ 3] Correlation Analyses
[ 2] > Partial
[ 5] > AutoCorrelation
-----(STPACF)
[
3]
>
Cross
[
6]
Correlation
>
Subsequencing/Resampling
(STXCF)
[ 4] Model Identification
[ 4] > Partial
[ 7] > Cross
Smoothing
Correlation (STPXCF)
[ 5] Parameter Estimation
==============================================
[ 5] > Extended
[ 8] > Missing
Cross Correlation
Data
(ExtSTXCF)
[ 6] Diagnostic Analysis
[ Model
[ 6]
Identification
> Plot
[ 9]Correlations
> Filter
]
with
versus
a given
T-Lag/S-Lag
STARMA model
[ 7] -----[ 1]
[ Automatic
7] > Return
[10]Identification
> Undo previous
(Type,Orders)
action
[ 8] Preference
[ 2]
==============================================
Artificial
[11] Identification
> Return
(Type,Orders)
===================
[
Parameter
Estimation
] ===================
[
3]
Parameter
==========================================
Masking
[ 9] Interpreter
[ 1] > Pre-estimate
Model Param
-- Linear (STAR)
[
4]
-----[10] Exit
[ 2] > Pre-estimate
Model Param
-- Non-linear (STMA,STARMA)
[ 5] Return
=============================================
[ 3] > Pre-estimate
Model Param
-- From STACF/STPACF
==============================================
====[[4]
Diagnostic
Analysis
] ====
> Pre-estimate
Model
Param
-- Specified by users
[
1]
>
Statistical
Significance
[ 5] > Estimate Model Param
-- Fixed SRM
============== [ Setup ] ==============
[ 2]
Analysis
[ >
6]AICC/BIC
> Estimate
SRM
-- Fixed Model Param
[ 1] > Space-time dataset
[ 3]
of Residuals
[ >
7]STACF
> Estimate
SRM & Model Param -- Alternatively
[ 2] > Spatial correlation structure
# list
[ 4]
of Residuals
[ >
8]STPACF
> Return
[ 3] > Information of datasets
10 load data[demo.dat
5]
>
-----================================================================
[ 4] > Return
20 load weight
uniform.wet
[ 6]
> Return
=======================================
30 stacf ST_ACF
Z 16 3
=================================
40 plotacf ST_ACF 16 3 "ACF"
23
:
:
:
2005/4/12
IEAST —
Menu-Driven Mode vs Programming Mode
In programming mode, a set of sophisticated instructions can be
used to compose programs to control the modeling flow and to
conduct statistical analyses.
Space-time Dataset: ‘demo.dat’
# name: DatafileZ
# type: matrix
# rows: 100
# columns: 100
-0.0350001 0.00197952 -0.00635348....
-0.0886448 0.0504684 -0.00369402....
0.025101 0.00844576 -0.00743455....
…………………..
Spatial Weighting Matrices: ’uniform.wet’
# name: SOD
# type: global matrix
# rows: 21
# columns: 21
0 0 0 0 0 0 0 0….
0 0 0 0 0 0 0 0….
……………….
IEAST Program ‘demo.pgm’
10 load data demo.dat
20 load weight uniform.wet
30 stacf STACF Z 16 3
…….
2005/4/12
[IEAST v1.30.01 - STARMA Modeling & Analysis]
=============== [ Main Menu ] =============
[ 1] Setup
:
:
[ 8] Preference
[ 9] Interpreter
[10] Exit
=============================================
=============================================
|| Welcome to STARMA analyzing interpreter ||
=============================================
# load program demo.pgm
# list
10 load data demo.dat
20 load weight uniform.wet
30 stacf STACF Z 16 3
………
100 end
# run
24
Modeling Emerging Infectious
Diseases using STARMA and IEAST
25
2005/4/12
State of Art for Statistical Analyses of
Emerging Infectious Diseases
 As far as we know, no true spatial-temporal statistical
models and methods have been used.
• Space-time cluster analysis available (Theophilides et al,
2003; Mostashari et al, 2003; Hoebe et al, 2004)
• Spatial models available (Watson et al, 2004).
• Temporal models available.
26
2005/4/12
Limitations of Simply Observing How a
Spatial Distribution Changes over Time
• For example, expansion of the leading edge of a disease
range.
• Is the disease spreading directly over long distances but
infrequently, or over short distances frequently?
• This is important for projecting the future spread.
27
2005/4/12
STARMA Has Potential for the Early
Characterization of Infectious Diseases.
• STARMA acts as a “prism”. Can filter the spatialtemporal correlations into direct effects with known
magnitude and spatial and temporal lags.
• Not generally a complete, mechanistic model, but
puts critical constraints on models.
28
2005/4/12
West Nile Virus
The West Nile Virus (WNV) was first detected in a
woman with a mild fever in the West Nile District of
Uganda in 1937. Since then WNV has been spreading
to North Africa, Europe, West and Central Asia, and the
Middle East.
29
2005/4/12
West Nile Virus in the United States
• Outbreak in NYC in Sep 1999. Vector is Culex mosquitoes.
• Wild birds (89% are American crows) are the principal
hosts. Humans, horses, etc. are incidental hosts.
• The incidence rate among crows is high. Infected crow
almost always die (68%).
• Surveillance of Dead crows
has been used as an
indicator of WNV epidemic.
(A figure from CDC web site)
2005/4/12
30
Dead Crow Data (DCD) & Human Case
Datasets (HCD) in 2002
Time: Summer in 2002 (April~October)
Place: Detroit metro area (Oakland, Macomb, and Wayne)
• DCD were collected systematically before and during an
outbreak among humans. Data mainly consisted of locations
and dates of reported public sightings.
• HCD were obtained from clinicians in Michigan. Data on
address of residence and date of onset of disease were
obtained from the case-patient or attending physician
through telephone interviews.
31
2005/4/12
Two Datasets Collected in 2002
Human Cases
Interview
GIS - ArcMap
Toll-free #
Dead Crows*
Longitude/Latitude
WWW
pages
* From www.rci.rutgers.edu/ ~insects/crowid.htm
2005/4/12
Data Cleaning &
Geocoding
32
Space-Time Analysis for
Dead Crow Data
33
2005/4/12
The Dead Crow Data
• Totally, 1817 dead crow sightings scattered within the three counties (red
lines), spanning 28 weeks.
• Covered area (after truncation): a rectangular area of 31.6x25.8 mi
34
• Divide the covered area into 10x10 cells. Cell size: 3.16x2.58mi
2005/4/12
Spatial Correlation Structure and Trends

Spatial correlation structure
(uniform weighting)
*
6

5

5
5

6
*


6 5 5 5 6 *
5 4 3 4 5 6
4 2 1 2 4 5

3 1 0 1 3 5
4 2 1 2 4 5

5 4 3 4 5 6
6 5 5 5 6 *
Preprocessing
– Remove spatio-temporal trend
• Spatial trend: 4th order polynomial
regression trend surface
• Temporal trend: averaging over
space.
– Remove mean
35
2005/4/12
Model Identification — STACF
Tail-off
STACF tails-off
2005/4/12
36
Model Identification — STPACF
Spatially
cut-off after
this lag
Temporally
cut-off after
this lag
2005/4/12
The STACF/STPACF suggest the model –
STAR(maxT=3, maxS=4).
37
Parameter Estimation
The parameters (ts) of this STAR model can be estimated in
IEAST by linear maximum likelihood estimator.
ts
s0
s 1
s2
s3
s4
t 1
0.26
0.36
0.10  0.09
0.04
t2
0.04  0.18  0.07  0.04  0.11
t  3  0.02  0.11
0.02  0.02  0.04
• Values in dark blue are nominally significant at the 0.001 level.
• Values in light blue are nominally significant at the 0.01 level.
38
2005/4/12
Diagnostic Check
• Statistical significance of parameters
– The probabilities P that ts are not significant are:
P
s0
s 1
s2
t 1
t2
t 3
0.001 0.001 0.01
0.04
0.001 0.04
0.4
0.01 0.25
s3
s4
0.1
0.3
0.9
0.4
0.01
0.6
• Residual’s autocorrelations
STACF
STPACF
39
2005/4/12
Interpretations for the DCD Analysis
• STAR(3,4) model is the best-fitted one.
• The max. of spatial and temporal lags that are important
are still smaller. S=2 (or 6.4 km) and T=2 weeks.
• Compare S=1 to S=2. Value for S=1 is much larger—cell
boundary length effects.
• The virus is not spreading very far very fast. Crows are not
much spreading the virus spatially, though they probably
are amplifying it locally.
• Negative Autoregressive Effect At S=1, and T=2,3.
– Appears to be a real effect.
– May be due to crow population depletion.
– Suggests there is a mixture of two STAR
processes, the dominant one reflecting
probability of infection, the other an echo
effect from depletion.
ts
s0
s 1
s2
s3
s4
t 1
0.26
0.36
0.10  0.09
0.04
t2
0.04  0.18  0.07  0.04  0.11
t  3  0.02  0.11
0.02  0.02  0.04
40
2005/4/12
Additional Analyses and Results
Additional Analyses:
• Using 20x20 and other cell configurations
• Using different lag structures “Pfeiffer’s” vs. “Ring structure”
• Using various polynomials for Spatial de-trending
• Using sub-sample of the data
Results:
• Consistent over various methods of spatial de-trending, except
high order polynomials resulted in smaller AR.
• Consistent AR values using different lag structures and cell sizes.
• Consistent implied spatial and temporal scales over which there
are significant or substantial AR effects
41
2005/4/12
Distances for Which There Are Significant
Spatial Correlation
• Based on different cell configurations: 10x10, 16x16, and 20x20
– The effective correlated area in the modeling result is consistently
about 10.75 km regardless of cell sizes.
Configurations
Cell sizes
Max S order of the
estimated model
Equivalent
distances
10 x10
5.08x4.15km
4
10.99 km
16 x 16
3.19x2.59km
6
10.88 km
20 x 20
2.54x2.08km
7
10.38 km
42
2005/4/12
Alternative Spatial Correlation Structures
*
6

5

5
5

6
*

6 5 5 5 6 *
5 4 3 4 5 6
4 2 1 2 4 5

3 1 0 1 3 5
4 2 1 2 4 5

5 4 3 4 5 6
6 5 5 5 6 *
Pfeifer’s
3
3

3

3
3

3
3

3 3 3 3 3 3
2 2 2 2 2 3
2 1 1 1 2 3

2 1 0 1 2 3
2 1 1 1 2 3

2 2 2 2 2 3
3 3 3 3 3 3
Ring structure
43
2005/4/12
Space-Time Analysis for
Human Case Data
44
2005/4/12
Human Case Data
• Over 500 human cases
spanning 13 weeks
• Date of onset-converted to
week
• Home addresses (names
stripped)-converted to “cell,”
same as for DCD.
• Used same arrays of cell sizes
and spatial correlation
structures as for DCD.
• Same spatial and temporal detrending method
45
2005/4/12
Model Identification — STACF
46
2005/4/12
Model Identification — STPACF
47
2005/4/12
Parameter Estimation
Temporal lags (weeks)
Spatial lags
s=0
s=1
t=1
0.26
0.06 -0.10 -0.29 -0.05 -0.30 -0.60
t=2
0.12
0.27
t=3
0.07
0.10 -0.15
0.05
0.00
0.06 -0.01
t=4
0.04 -0.17 -0.07 -0.02
0.16
0.25
0.11
t=5 -0.01 -0.10 -0.04
0.10 -0.06
0.11
0.06
t=6 -0.04
0.03 -0.03 -0.19 -0.09
0.08
s=2
s=3
s=4
s=5
s=6
0.13 -0.12 -0.11 -0.22 -0.11
0.09
• Values in dark blue are nominally significant at the 0.001 level.
• Values in light blue are nominally significant at the 0.01 level.
2005/4/12
48
Diagnostic Check
• Residual’s STACF and STPACF
STACF
STPACF
49
2005/4/12
Interpretations for the HCD Analysis
•
•
•
•
Most people are getting infected at or near their homes.
The incidences are highly autocorrelated in space and time.
The distribution or probability of infection is highly “localized”.
The WNV “load” and probability of human infection is “spreading”
slowly, in the sense of not spreading very far very fast.
• Suggests localized spraying could reduce cases.
• Without depletion effect, the human case data show positive and
significant above zero for T-lag=2 and S-lag>=1, esp. at S-lag=1.
2005/4/12
s=0
s=1
s=2
t=1
0.26
0.06 -0.10
t=2
0.12
0.27
t=3
0.07
0.10 -0.15
0.13
50
Space-Time Cross Analysis for
HCD and DCD
51
2005/4/12
Space-Time Data HCD and DCD
• The areas for cross analysis are same for both datasets.
• The configuration is again 10x10 and spanning 28 weeks.
• Cell size is 6.31x6.31 km.
52
2005/4/12
Both Temporal Epidemic Curves
 Dead crow reported is leading human cases in time.
2005/4/12
53
Space-Time Cross Correlations
-3
2005/4/12
54
Interpretations for Space-Time Cross
Correlations
• Drop smoothly to zero spatially and temporally.
• Very large (as high as 0.7).
• Across all spatial lags, the max. cross correlations are aligned at –3
weeks.
• The cross correlations at spatial lag 1 is slightly greater than at
spatial lag 0.
• When temporal lag decreases to –8 or below, the correlations
between these two datasets are negligible (<0.1).
• When spatial lag increases up to 10, the cross correlations are
reduced to as low as 0.2.
55
2005/4/12
Is the Cross Correlations Spurious?
The autocorrelation of the
DCD can spuriously
contribute to cross
correlations. To eliminate
this effect, both datasets
were pre-whitened before
calculating cross
correlations.
Cross correlation with pre-whitening
• The result shows that the ‘real’ cross correlations are much larger
than the ‘spurious’ components.
56
2005/4/12
Summary for Modeling the Spread of WNV
• Crows are not spreading the disease spatially very far very
fast.
• Spread is very localized, perhaps other animals or the
mosquitoes themselves are spreading it spatially.
• Humans are being infected largely at or near their homes.
• Both crows and humans appear to be responding to local
viral loads.
• Dead crow findings precede human cases by two to three
weeks. Dead crows can be a good indicator of human
epidemics.
57
2005/4/12
Conclusion
• It appears that STARMA modeling could be an important
tool of the early characterization of many emerging and reemerging infectious disease epidemics.
• During the course of an epidemic, it could be used (in
principle) for forecasting, under existing conditions or
under potential courses of action.
• While not generally a mechanistic model, STARMA does
inform spatial and temporal scales of spread, hence
places constraints on mechanistic models (which
otherwise may have too many parameters).
58
2005/4/12
Funding Acknowledgements
• Michigan Agricultural Experiment Station,
Michigan State University.
• Center for Emerging Infectious Diseases,
Michigan State University.
• Centers for Disease Control and Prevention,
USA.
59
2005/4/12
Thanks for your attention!
& Questions?
60
2005/4/12
References
•
•
•
•
C.J.P.A. Hoebe, H. de Melker, L. Spanjaard, J. Dankert, and N. Nagelkerke. Spacetime cluster analysis of invasive meningococcal disease, Emerging Infectious
Disease, Vol.10, No. 9, p1621-1626, 2004.
C.N. Theophilides, S.C. Ahearn, S. Grady, and M. Merlino. Identifying West Nile virus
risk areas: The dynamic continuous-area space-time system. American Journal of
Epidemiology, 157:843-854, 2003.
J. Watson, R. Jones, K. Gibbs, and W. Paul. Dead crow reports and location of human
West Nile virus cases, Chicago, 2002. Emerging Infectious Diseases, 10(5):938-940,
2004.
F. Mostashari, M. Kulldorff, J.J. Hartman, J.R. Miller, V. Kulasekera. Dead bird
clustering: A potential early warning system for West Nile virus activity. Emerging
Infectious Diseases, 9:641-646, 2003.
61
2005/4/12