Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ABSTRACT Title of dissertation: Semiparametric Cluster Detection Shihua Wen Doctor of Philosophy, 2007 Dissertation directed by: Professor Benjamin Kedem Mathematical Statistics Program Department of Mathematics In this dissertation, a Semiparametric density ratio testing method which borrows strength from two or more samples is applied to moving windows of variable size in cluster detection. This Semiparametric cluster detection method requires neither the prior knowledge of the underlying distribution nor the number of cases before scanning. To take into account the multiple testing problem induced by numerous overlapping windows, Storey’s q-value method, a false discovery rate (FDR) methodology, is used in conjunction with the Semiparametric testing procedure. Monte Carlo power studies show that for binary data, the Semiparametric cluster detection method and its competitor, Kulldorff’s scan statistics method, both achieve similar high power in detecting unknown hot-spot clusters. When the data are not binary, the Semiparametric methodology is still applicable, but Kulldorff’s method may not be as it requires the choice of a correct probability model, namely the correct scan statistic, in order to achieve power comparable to that achieved by the Semiparametric method. Kulldorff’s method with an inappropriate probability model may lose power. Moreover, when the data are binary, the Semiparametric density ratio model reduces to the same scan statistic as Kulldorff’s Bernoulli model. If a cluster candidate is known, under certain conditions the Semiparametric method achieves a higher power than the power achieved by a certain focused test in testing the hypothesis of no cluster. The Semiparametric method potential in cluster detection is illustrated using a North Humberside childhood leukemia data set and a Maryland-DC-Virginia crime data set. Semiparametric Cluster Detection by Shihua Wen Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Professor Benjamin Kedem, Chair/Advisor Professor Larry Davis Professor Laura Dugan Professor Galit Shmueli Professor Paul Smith c Copyright by ° Shihua Wen 2007 Dedication For my parents. ii Acknowledgments With sincere gratitude, I want to thank all the people who helped and encouraged me through my graduate study. I also want to thank God or Buddha, they always come at the last minute but still in time to give me inspiration for solving the difficulties I met. Special thanks to my adviser, Professor Benjamin Kedem. This research could not be done without his correct guidance, sincere encouragement, valuable hints, and tremendous help in my writing. I appreciate him for giving me such an interesting topic and for introducing me to a wonderful Semiparametric idea. I consider myself lucky working with him. I enjoy all the talking with him, and his humor impresses me a lot. I also want to thank all my committee members. Professor Paul Smith is always nice to everybody. I learned Linear Models from him, and thank him for giving me the opportunity to work in the STAT lab. It was a great experience. Professor Galit Shmueli gave me great help and encouragement in this research. I thank her for introducing me to other current methods in scan statistics and cluster detection. I also appreciate her sharing research papers and other research information with me. Professor Laura Dugan is an expert in criminology and gave me a lot of valuable information on crime research as well as useful data sets. I am grateful for her generous help and her willingness to serve on my committee. Professor Larry Davis is a well known scholar in computer science. It is my great honor to have Professor Davis as the Dean’s Representative in my dissertation committee. iii I would like to extend my thanks to Dr. Tiwari for serving on my candidacy committee, his comments on my research, and for arranging a wonderful opportunity to talk at National Institute of Cancer (NCI/NIH). I also wish to express my gratitude to Professor Martin Kulldorff at Harvard University and to Professor Reza Modarres at George Washington University for their important comments and suggestions. Finally, I want to thank my parents for their endless support and trust, my friends for their great help and encouragement. Thanks a lot for everybody again. This study would never have been accomplished without all of these people. I really appreciate it. iv Table of Contents List of Tables vii List of Figures viii 1 Introduction 1.1 Scan Statistics and Cluster Detection . . . . . . . . . . . . . . . . . . 1.2 Other Scan Statistics Methods . . . . . . . . . . . . . . . . . . . . . . 1.3 Dissertation Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Kulldorff’s Scan Statistics Method 2.1 Overview . . . . . . . . . . . . 2.2 Bernoulli Model . . . . . . . . 2.3 Poisson Model . . . . . . . . . 2.4 Ordinal Model . . . . . . . . . 2.5 Other Models . . . . . . . . . 1 1 4 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 12 14 16 18 3 Semiparametric Scan Statistics Method 3.1 Overview . . . . . . . . . . . . . . . . . . . 3.2 Semiparametric Density Ratio Model . . . 3.2.1 The Model . . . . . . . . . . . . . . 3.2.2 Choice of the Tilt Function . . . . 3.2.3 Parameter Estimation of the Model 3.2.4 Hypothesis and Test Statistics . . . 3.3 Semiparametric Cluster Detection . . . . . 3.4 FDR Method and q-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 22 22 23 24 27 29 34 . . . . . . . . . . . 39 40 40 41 43 50 51 51 55 59 59 62 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Power Study 4.1 Limited Power Study . . . . . . . . . . . . . . . . . 4.1.1 Focused Tests . . . . . . . . . . . . . . . . . 4.1.2 Data and Simulation plan . . . . . . . . . . 4.1.3 Results for Various Probability Distributions 4.2 Comprehensive Power Study: Overview . . . . . . . 4.3 Comprehensive Power Study: Binary Data . . . . . 4.3.1 Binary Data Set and Simulation Plan . . . . 4.3.2 Results for Binary Data . . . . . . . . . . . 4.4 Comprehensive Power Study: Non-binary Data . . 4.4.1 Non-binary Data Set and Simulation Plan . 4.4.2 Results for Ordinal Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Data Analysis examples 68 5.1 North Humberside Childhood Leukemia Data . . . . . . . . . . . . . 68 5.2 Maryland-DC-Virginia Crime Data . . . . . . . . . . . . . . . . . . . 70 v 6 Summary and Discussion 82 A Appendix 89 A.1 Derivation of S, V . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.2 Simplified Semiparametric Test Statistics for Binary Data . . . . . . . 92 Bibliography 94 vi List of Tables 3.1 Classification of m hypothesis tests . . . . . . . . . . . . . . . . . . . 35 4.1 Parameters for Simulating the Ordinal Categorical Data from Quantized Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.1 Results of High and Low Crime Risk Cluster from Yr 2001∼2004 . . . 74 vii List of Figures 1.1 Illustration of Glaz’s scan statistic in one dimension. . . . . . . . . . 2.1 Notation for Kulldorff’s method . . . . . . . . . . . . . . . . . . . . . 11 3.1 (a) The whole study region. (b,c,d,e) Intermediate stages during the scan. (f) The red region is the true cluster. The true cluster was detected by both methods. . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Power curves for one-sided tests in the Bernoulli case. Scalar β, h(x) = x. The focused test dominates the two other tests. . . . . . . 43 4.2 Power curves for one-sided tests in the Poisson case. Scalar β, h(x) = x. The power curves from the semiparametric and focused tests are fairly close. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Power curves for one-sided tests in a clipped Poisson case. Scalar β, h(x) = x. The semiparametric method gives relatively higher power. . 45 4.4 Power curves for one-sided tests applied to Quantized normal samples with the same variance 16 but different means. Scalar β, h(x) = x. The focused test gives higher power. . . . . . . . . . . . . . . . . . . 47 4.5 Power curves for one-sided tests applied to Quantized normal samples with the same variance 16 but different relatively high means. Scalar β, h(x) = x. The semiparametric test clearly dominates the other two tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6 Power curves for two-sided tests applied to quantized normal samples with the same mean but different variances. The semiparametric method uses h(x) = (x, x2 )0 , χ1 , χ2 , LR. The semiparametric tests markedly dominate the two other tests. . . . . . . . . . . . . . . . . . 48 4.7 Quantized normal case as in Figure 4.6 but with different means and different variances. The semiparametric tests clearly dominate the two other tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.8 sketch map of the US northeastern states. . . . . . . . . . . . . . . . 52 4.9 Power comparison between Kulldorff’s and the Semiparametric with likelihood ratio test methods for binary type data using the northeastern US benchmark data with 600 simulated cases. . . . . . . . . . 57 viii 5 4.10 Power comparison between Kulldorff’s and the Semiparametric with likelihood ratio test methods for binary type data using the northeastern US benchmark data with 6000 simulated cases. . . . . . . . . 58 4.11 Map showing the states included in the simulation denoted with color and the abbreviation of state names. The state Illinois with red color is illustrated as one possible cluster region in our simulated data. . . 60 4.12 Box Plots of the Simulated the Ordinal Categorical Data . . . . . . . 61 4.13 Power comparison between Kulldorff’s and the Semiparametric methods for ordinal categorical data generated from quantized normal II data, where the means are the same but the variances are different, between the cluster region and the rest of the area. . . . . . . . . . . 64 4.14 Power comparison between Kulldorff’s and the Semiparametric methods for ordinal categorical data generated from quantized normal III data, where both means and variances are different inside and outside the cluster region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.15 Power comparison between Kulldorff’s and the Semiparametric methods for ordinal categorical data with small differences. The data are in quantized normal III small type. (a) Existence. (b) Accuracy. 67 5.1 Snapshot of part of the North Humberside childhood leukemia and lymphoma data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 (a.) The geographical map. (b). The detected cluster candidate in red. 71 5.3 Snapshot of part of the Maryland-DC-Virginia Crime data set . . . . 73 5.4 Maryland-DC-Virginia High and Low Risk Crime Cluster. Red means high crime risk cluster, Navy means low crime risk cluster. . . . . . . 75 5.5 Maryland-DC-Virginia Arrest Rate by County in Year 2001 . . . . . . 76 5.6 Maryland-DC-Virginia Arrest Rate by County in Year 2002 . . . . . . 77 5.7 Maryland-DC-Virginia Arrest Rate by County in Year 2003 . . . . . . 78 5.8 Maryland-DC-Virginia ArrestRate by County in Year 2004 . . . . . . 79 ix Chapter 1 Introduction In this dissertation, I develop a semiparametric scan statistics method for cluster detection. I refer to this method as semiparametric cluster detection method or semiparametric method in short. This method applies a semiparametric density ratio model to moving windows of variable size to scan the study region and detect potential clusters of events or cases. The simulation studies show that the statistical power of the semiparametric method is comparable to the current Kulldorff’s spatial scan statistics method [29, 30, 31], but the semiparametric method requires fewer distributional assumptions on the data. The semiparametric method works well in many cases adhering to a unified setting [26, 70], but Kulldorff’s method requires the choices of a correct probability model, namely the correct scan statistic, in order to achieve the power achieved by the semiparametric method. The semiparametric scan statistics methodology has also been successfully applied to real data which points to its potential in cluster detection. The first chapter gives a brief description of the purpose and the general frame of this dissertation. 1.1 Scan Statistics and Cluster Detection Scan statistics arise when scanning in time or space, or both, looking for unusual clusters of certain events or cases [18]. Here, an event can be the occurrence 1 of some type of disease, or some sort of physical or chemical measurements, etc. A cluster is defined as a certain spatial or temporal subregion where the probability distribution of an event is different from the event probability distribution in the rest of the region. More generally, a cluster is a subregion where the behavior of an observable is different from the behavior of the observable in the rest of the region. For instance, a city neighborhood where the crime rate is higher than in the rest of the city defines a cluster. Another example can be a subregion comprised of several counties with higher disease rate than all other counties in a region. If we can locate (detect) the clusters more accurately, we can make better decisions and more efficient policies. The modern literature about scan statistics can be traced back to the 1960’s. Since then, it has been applied in many fields, including epidemiology, criminology, economics, health management, brain imaging, genetics, mining, quality control, astronomy, syndromic surveillance, and so on. See Glaz et al. (2001), Glaz and Naus (1991), Kulldorff (1999), Naus (1965), Pickle et al. (2003), and Shmueli et al. (2006) [18, 43, 30, 49, 58]. Of particular importance is the so called Kulldorff’s spatial scan statistics method. Kulldorff’s method uses circular, elliptic, or cylindrical scan window to detect clusters in two or higher dimensions, their location and size, by making an assumption about the underlying distribution (typically Bernoulli or Poisson) of the scanned region. These distributional assumptions are used in computing Kulldorff’s spatial scan statistic, a likelihood ratio type test statistic, to determine the cluster candidate. A p-value for the significance of the cluster candi- 2 date is obtained by a Monte Carlo hypothesis testing procedure or by a permutation test [29, 30, 31]. A detailed review of Kulldorff’s method is given in Chapter 2. In this dissertation we propose a certain semiparametric generalization of Kulldorff’s method which requires much less than complete distributional assumptions and which does not require the number of cases prior to scanning for the time consuming Monte Carlo hypothesis testing. The semiparametric approach used in this research is the density ratio model as in Fokianos et al. (2001), Qin and Lawless (1994), and Qin and Zhang (1997) [13, 52, 53]. Given m = q + 1 samples, gj (x) = exp{αj + β 0j h(x)}, gm (x) j = 1, . . . , q, q =m−1 where gm (x) ≡ g(x) is the (reference) probability density function of the mth sample, gj (x) is the probability density function of the jth sample, (αj , β 0j ) are the parameters relating the jth and the reference densities, and β j = (βj1 , . . . , βjp )0 with dimension p depends on the choice of the known tilt vector-valued function h(x). By following this setup, testing for distributional homogeneity is equivalent to testing H0 : β=0, where β = (β 01 , . . . , β 0q )0 . Other than an assumption concerning the tilt function h(x), this method does not require prior knowledge of any distribution. Once h(x) is chosen, all the parameters and the reference distribution function G(x) are estimated from the combined data composed of all the samples. The semiparametric cluster detection method discussed in this dissertation merges the semiparametric density ratio model with Kulldorff’s scan procedure leading to a fairly general cluster detection procedure. The idea is quite natural. Since cluster detection amounts essentially to testing the homogeneity of the probability 3 distributions between the cluster region and the non-cluster region, the cluster detection problem is to test if β = 0 in the semiparametric density ratio model. A detailed study of this semiparametric method is described in Chapter 3. 1.2 Other Scan Statistics Methods Besides Kulldorff’s method and the semiparametric method proposed in this dissertation, there are other types of scan statistics methods. I list some of them in this section. Glaz et al. (2001) defines a scan statistic Sw based on point data where the occurrence of an event is represented by a point in the study interval. This interval could be a one dimensional line, such as time, or a higher dimensional set, such as a geographic map [18]. The point data are assumed to be distributed uniformly or to follow a Poisson process over the whole study interval. Figure 1.1 gives an example of this type of scan statistic. The solid line represents a time line scaled into [0, 1). Each small triangle (point) is denoted as an event occurring at that moment. A scan window slides along the line. Let Sw be the largest number of events occurred in a window of fixed size w. Then this Sw is called the scan statistic and the corresponding window is the cluster candidate. The problem of interest is the probability of k or more events occurred in the given window w. More precisely, if the total number of N events over the interval is given, the problem is to compute the retrospective probability P (Sw ≥ k|N ), which is a conditional probability. If N is viewed as a random variable, on the other hand, the prospective probability 4 P (Sw ≥ k) is unconditional. Figure 1.1: Illustration of Glaz’s scan statistic in one dimension. Given N points independently uniformly distributed on interval [0,1), Wallenstein and Neff (1987) [67] gives the following easy to compute approximation for P (Sw ≥ k|N ) as a simple sum of binomial and cumulative binomial probabilities. Let N k w (1 − w)N −k b(k; N, w) = k Gb (k; N, w) = N X b(i; N, w) i=k Then approximately, µ P (Sw ≥ k|N ) ≈ ¶ k − N − 1 · b(k; N, w) + 2Gb (k; N, w) w (1.1) When the data follows a Poisson Process on the interval [0,T) with rate λ, Newell (1963) [46] gives the following asymptotic formula of P (Sw ≥ k), P (Sw ≥ k) ≈ 1 − exp{−λk wk T /(k − 1)!} Glaz and Naus (1991) give tight bounds and approximations for scan statistic probabilities for independently and identically distributed (i.i.d.) discrete data for 5 fixed window size [17]. Naus and Wallenstein (2004) derive accurate approximations for the joint distributions of scan statistics for a range of values of w, or of k, that can be used to set an experiment-wide level of significance that takes into account the multiple comparisons involved. This makes it possible to determine the cluster sizes from various scanning window sizes [45]. Pozdnyakov et al. (2004) propose a martingale method for binary data to approximate the distributions of a wide variety of scan statistics, including some for which analytical results are computationally infeasible [50]. Glaz and Zhang (2004) derive multiple scan statistics of variable window sizes for i.i.d. Bernoulli trials (0/1)in one or two dimensional intervals. They also derive simple approximations for the significance level of the scan statistics [19]. Glaz and Zhang (2006) propose a maximum scan score-type statistic for testing the null hypothesis that the observations are i.i.d. according to a specified distribution, against an alternative that the observations cluster within a window of unknown length. This statistic is a variable window scan statistic, based on a finite number of standardized fixed window scan statistics. Approximations for the significance level of this statistic are derived for 0 − 1 i.i.d. Bernoulli trials uniformly distributed in the interval [0, 1). Kulldorff’s method and the semiparametric method we propose in this dissertation use circular or regular shape scanning windows. Patil and Taillie (2004) propose a upper level set (ULS) scan method [47, 48] which detects hot-spot clusters with irregular shape. The main idea of the ULS method is that the whole study region is composed of cells with rate (or intensity) Ga = Ya /Aa , where Ya is the raw 6 count and Aa is the “size” of cell a. A zone Z is a union of connected cells and Ω is a collection of all the possible Z’s. For a given g, define an upper level set as Ug = {a : Ga ≥ g}. A reduced space ΩU LS is a collection of all the possible unions of the connected cells in Ug . Thus, all the zones, Z ∈ ΩU LS , can possibly become scan windows and can be of any shape. Once the scan window is determined, similar to Kulldorff’s method, a probability assumption is imposed to the study region to get a likelihood ratio type statistic and the p-value of the cluster candidate is obtained by Monte Carlo simulation. Modarres and Patil (2006) extend this ULS methodology to bivariate data [42]. Tango and Takahashi (2005) propose a flexibly shaped scan method [65]. It imposes an irregularly shaped window Z on each cell (e.g. county) by connecting its adjacent cells and computes the likelihood ratio type statistic as in Kulldorff’s method. Most of the above mentioned methods apply to point data where the occurrence of an event is represented by a point and those points are assumed to be distributed uniformly or follow a Poisson process in the study interval. Kulldorff’s method and the ULS method can handle non-binary data, but still need to assume a certain probability model. The semiparametric cluster detection method proposed in this dissertation, however, does not require those specific assumptions and is applicable to many data types. See Chapter 3 for the details of the semiparametric method. The above scan statistics methods are mainly used to detect the location, the size and the significance of local clusters. If the hypothesis is that the risk in a 7 specified region is higher than in the rest of the region, for example, the risk of a type of disease is higher close to a nuclear power plant than in the rest of the area, then focused cluster tests are used. I will briefly describe the Lawson-Waller focused test [68] in Chapter 4 when we conduct the power study. Sometimes researchers are interested in evaluating the presence of clustering throughout the study region. For example, we might want to know if a particular disease is infectious or not, in which case we would expect cases to be found close to each other no matter where they occur. In this case, global clustering tests should be used, such as Cuzick-Edwards’ (1990) k nearest neighbor (k-NN) method [6], Tango’s (2000) maximized excess events test (MEET), Bonetti-Pagano’s (2005) M-statistic [4], and so on. Since the semiparametric method is a cluster detection method aiming to detect the local clusters, these global clustering tests are out of the focus of this dissertation. In surveillance or quality control fields, early detection of outbreaks is essential for successful operation and prevention of disasters. In the purely temporal setting, traditional control charts method, including Shewhart charts, moving average charts, Cumulative sum (CumSum) charts, etc., and time series methods are used. Recently wavelet-based methods were found to offer a more elegant and suitable solution for early detection. If multiple data sets or streams are present, the multivariate versions of the control charts, time series, and wavelet-based methods could be potentially implemented. Shmueli and Fienberg (2006) gives a nice review about the these options [58]. Naus and Wartenberg (1997) have developed purely temporal scan statistics for 8 two data types with the purpose of finding clusters with a minimum number of both types of events [44]. In addition, Kulldorff et al. (2007) develop a multivariate scan model which simultaneously incorporates multiple data sets into a single likelihood function to search for clusters. Chapter 2 gives a brief description of Kulldorff’s multivariate scan model. 1.3 Dissertation Map This dissertation is organized as follows. Chapter 2 describes Kulldorff’s scan statistics method, including the Bernoulli model, Poisson model, ordinal model, exponential model, and so on. Chapter 3 introduces the semiparametric density ratio model and our semiparametric cluster detection method. Chapter 4 presents power studies comparing Kulldorff’s method and our semiparametric method, including a limited power study given that the location and the size of a cluster candidate are known, and a complete power study which where the cluster candidate is unknown. Chapter 5 illustrates the cluster detection potential of the semiparametric method by analyzing a North Humberside childhood leukemia data set [26, 1] and a MarylandDC-Virginia crime data set. Chapter 6 summarizes the whole dissertation and discusses possible improvements of the semiparametric cluster detection method for future research. 9 Chapter 2 Kulldorff’s Scan Statistics Method 2.1 Overview A number of different tests for detecting spatial clusters, temporal and spatial, have been proposed in the last three decades. One of the most popular methods is Kulldorff’s scan statistics. Kulldorff’s scan statistics method can detect both the location and the size of a cluster simultaneously by using a large collection of overlapping scan windows [29, 30]. For spatial data, the method first imposes a circular scan window on a map and lets the circle centroid move across the study region. For any given centroid, the radius of the window varies continuously from zero to some upper limit. Usually this upper limit is set to be the radius which covers 50% of the whole study region or population. In this way, the method generates a large set of scan windows Z with different centroids and sizes. Under the null hypothesis of no cluster, the underlying behavior of the data throughout the whole study region is the same. Under the alternative hypothesis, there is at least one scan window for which the underlying behavior is different inside the window as compared with its complement, which means any scan window Z could be a potential cluster. In practice, some data are updated periodically, Kulldorff (2001) [31] suggested space-time scans for such cases. The scanning procedure of space-time scans is almost identical to the purely spatial scan, except that the scan window becomes 10 a three dimensional cylinder instead of two dimensional. See Figure 2.1. Since the statistical formulation of space-time scan is identical to the two dimensional case, we will only discuss the two dimensional purely spatial scan in this paper. Figure 2.1: Notation for Kulldorff’s method In Kulldorff’s scan statistics method, each scan window Z is associated with a likelihood ratio test statistic λ(Z) which can be computed based on the chosen underlying probability model and the observed data inside and outside the scanning window. The scan window associated with the maximum λ(Z) is defined as the primary cluster candidate occurring not by random chance. The maximum likelihood ratio itself is called the Kulldorff’s spatial scan statistic, and the null hypothesis is rejected for large value of the statistic. After the spatial scan statistic and the primary cluster candidate are determined, a Monte Carlo hypothesis procedure [10] or a permutation test procedure [33, 22] is executed to generate the probability distribution of Kulldorff’s scan statistic under the null hypothesis of no cluster in the study region, and a p-value is obtained. The detail steps of the Monte Carlo hypothesis testing procedure is as follows. 11 1. Obtain the value of Kulldorff’s scan statistic for the true data at hand. 2. Given the total number of the cases (events), create a large number of random data sets generated under H0 for the whole study region. 3. Calculate the value of Kulldorff’s scan statistic for each random replication. 4. Sort the values of the Kulldorff’s scan statistics from the true and the generated data sets, and note the rank of the one calculated from the true data set to obtain the p-value. The following sections describe Kulldorff’s scan statistics for the Bernoulli model, Poisson model, and ordinal model [29, 30, 23]. As for other types of scan statistics in Kulldorff’s scan statistics family, see Huang et al. (2007) for the exponential model [22], Kulldorff et al. (2007) for the multivariate scan model [36], Kulldorff et al. (2006a) for the normal model [35], and Kulldorff et al. (2006b) for elliptic window scans [34]. 2.2 Bernoulli Model Bernoulli-based scan statistics are used when individual entities have only two states such as an individual person having breast cancer or not. Figure 2.1(a) shows a typical setup of Kulldorff’s scan statistics method. The “purely spatial scan” means a two-dimensional scan on a geographical map. • G : the whole study region. 12 • Z : the scan window. • Z C : outside scan window. • µ(G) : the total number of individual. entities (e.g. people) in G. • µ(Z) : the number of individual entities in Z. • nG : the total number of events in G. • nZ : the number of events inside Z. • p : the rate of events that occurred inside the scan window Z. • q : the rate of events that occurred outside the scan window Z. Clearly, µ(Z c ) = µ(G) − µ(Z) and nZ C = nG − nZ . We test the null hypothesis H0 : p = q of no cluster. The following shows that for an alternative hypothesis that there is a hot spot cluster p > q, each scan window Z invokes a likelihood ratio test statistic as in equation (2.1). Consider binary 0 − 1 data. The likelihood for a fixed scan window Z is L(Z, p, q) = pnZ (1 − p)µ(Z)−nZ × q nG −nZ (1 − q)(µ(G)−µ(Z))−(nG −nZ ) . Under H0 : p = q, p̂0 = nG /µ(G), and the maximized likelihood becomes L0 which is independent of Z, L0 = sup L(Z, p, q) = pˆ0 nG (1 − pˆ0 )µ(G)−nG . H0 :p=q 13 Under HA : p > q, p̂ = nZ /µ(Z), q̂ = (nG − nZ )/ (µ(G) − µ(Z)), and the likelihood L(Z) is a function of Z, L(Z) = sup L(Z, p, q) HA :p>q = p̂nZ (1 − p̂)µ(Z)−nZ × q̂ nG −nZ (1 − q̂)(µ(G)−µ(Z))−(nG −nZ ) . Therefore, the likelihood ratio for a fixed scan window Z is sup L(Z, p, q) λ(Z) = HA :p>q sup L(Z, p, q) n µ(Z)−nZ ×q̂ nG −nZ (1−q̂)(µ(G)−µ(Z))−(nG −nZ ) p̂ Z (1−p̂) if p̂ > q̂ n p̂ G (1−p̂ )µ(G)−nG H0 :p=q = 0 0 1 . (2.1) otherwise If we were scanning for cold spots, then “>” would change to “<” above; if we were scanning for either hot or cold spots, then it would be “6=” [32]. After all the λ(Z) are computed, we determine the maximum of λ(Z)’s λ = sup λ(Z) ≡ λ(Ẑ) Z∈G as Kulldorff’s scan statistic and Ẑ to be the primary cluster candidate. We reject the null hypothesis for large values of λ, and a Monte Carlo based p-value can be obtained from randomization of the cases across the whole study region, given the total number of cases as described in the previous section. 2.3 Poisson Model Poisson-based scan statistics are used for the comparison of the number of cases inside and outside a scan window when searching for clusters. The notion and 14 setup of Kulldorff’s scan under the Poisson model follow the same scheme shown in Figure 2.1(a). Suppose a study region is composed of I sub-regions. Assume the number of events xi which occur in an “interval” µ(Ai ) is a Poisson process with intensity rate p inside the scan window and q outside, where i = 1, 2, . . . , I. For example, xi could be the number of people with certain type of cancer in county Ai whose population size is µ(Ai ). In addition, we have X nG = xi , µ(G) = Ai ,xi ∈G X µ(Ai ) Ai ∈G and X nZ = xi , µ(Z) = Ai ,xi ∈Z X µ(Ai ). Ai ∈Z The null hypothesis is still H0 : p = q and the method parallels the Bernoulli case. For a given fixed scan window Z, the likelihood is L(Z, p, q) = e−p·µ(Z) (p · µ(Z))nZ nZ ! e−q·(µ(G)−µ(Z)) (q · (µ(G) − µ(Z)))nG −nZ . × (nG − nZ )! Under H0 : p = q, p̂0 = nG /µ(G), we obtain L0 = sup L(Z, p, q) = nG nG e−nG · ( µ(G) ) · µ(Z)nZ · (µ(G) − µ(Z))nG −nZ nZ ! · (nG − nZ )! H0 :p=q . Under HA : p > q, p̂ = nZ /µ(Z) q̂ = (nG − nZ )/ (µ(G) − µ(Z)), we obtain L(Z) = sup L(Z, p, q) = HA :p>q e−nG · (nZ )nZ · (nG − nZ )nG −nZ . nZ ! · (nG − nZ )! Therefore, the likelihood ratio for the scan window Z is shown in equation (2.2): ( nZ )nZ · ( nG −nZ )(nG −nZ ) if nZ > eZ eZ nG −eZ λ(Z) = (2.2) 1 otherwise 15 where nZ is the observed number of cases inside the scan window Z and eZ is the expected number of cases inside the scan window Z under the null hypothesis of no cluster. As before, if we were scanning for a cluster other than hot spot, we simply change the inequality sign as needed. After all the λ(Z) are obtained, put λ = max λ(Z) ≡ λ(Ẑ), where Ẑ is the primary cluster candidate, and reject the Z null hypothesis for large λ. A Monte Carlo based p-value can be obtained from randomization of the cases across the study region given the total number of cases, as in the Bernoulli model. 2.4 Ordinal Model An ordinal model is used when individual entities have K ≥ 2 ordinal categories such as the different stages of prostate cancer (Klassen, 2005) [28]. A higher category may reflect a more serious cancer stage. With the ordinal model, each observation is a case, and each case belongs to one of several ordinal categories. Suppose the study region consists of I sub-regions and the variable of interest is recorded in K categories. Let cik be the number of individuals in location i who fall into category k, where i = 1, 2, . . . , I, and k = 1, 2, . . . , K. Let Ck = P of observations in category k across the study region, and C = i cik P k be the number Ck = P P k i cik be the total number of observations in the whole study region. The null hypothesis of no cluster in this model means p1 = q1 , . . . , pk = qk , where pk and qk are the unknown probabilities that an observation belongs to category k inside and outside the scanning window, respectively. To detect subregions with high rates of higher 16 stages as compared with the rest of the area, one possible alternative hypothesis could be p1 p2 pK ≤ ≤ q1 q2 qK searching for hot-spot clusters with an excess of cases in the high-valued categories. Obviously when K = 2, the ordinal model set up reduces to the Bernoulli model. Following similar scan procedures as in the Bernoulli and Poisson models, the likelihood ratio test statistic for each scan window is: " #Á K K ¡ ¢ Q Q Q Q Ck Ck pˆk cik · qˆk cik C k=1 i∈Z i6∈Z k=1 λ(Z) = 1 otherwise (2.3) pk = qk , k = 1, 2, . . . , K where p̂k and q̂k are the MLEs of pk and qk under the alternative hypothesis. A “Pool-Adjacent-Violators” algorithm can be applied to compute p̂k and q̂k [2, 11]. It is also possible to search for cold-spot clusters with an excess of cases in the lowvalued categories or simultaneously for both hot or cold spots by reversing the order of the categories. After all the λ(Z) are obtained, compute max λ(Z) ≡ λ(Ẑ), where Z Ẑ is the primary cluster candidate, and reject the null hypothesis for large λ(Ẑ). A Monte Carlo based p-value can be obtained by randomization of the observations across the study region given the total number of observations in each category (C1 , C2 , . . . , CK ). For more detail about the ordinal model, see Jung et al. (2007) [23]. 17 2.5 Other Models There are other types of scan statistics in Kulldorff’s scan statistics family, such as the exponential model mainly for survival time data, the normal model for continuous data that takes both positive and negative values, the multivariate scan model for analyzing multiple surveillance data sets simultaneously, and so on. Exponential model: The exponential model is mainly designed for survival time data, and the likelihood function for the scan statistic is based on the exponential distribution. However, it also could be used for other positive continuous type data as well, especially for data with a heavy right tail. In the exponential model, each observation is a case, and each case has one continuous variable attribute as well as a 0/1 censoring designation. For survival data, the continuous variable is the time between diagnosis and death or, depending on the application, between two other types of events. If some of the data are censored, due to loss of follow-up, then the continuous variable is the time between diagnosis and time of censoring. The 0/1 censoring variable is used to distinguish between censored and non-censored observations. For more details about the exponential model and its scan statistic, see Huang et al. (2007) [22]. Normal model: The normal model is designed for continuous data and the likelihood function for the scan statistic is based on the normal distribution. For each individual, called a case, there is a single continuous attribute that may be either negative or positive. For example, the data may consist of the birth weight and residential census tract for all newborns, with an interest in finding clusters with 18 lower birth weight. The model can also be used for ordinal data when there are very many categories. That is, ties are allowed. It is also noticed that the results from the normal model can be greatly influenced by extreme outliers, so it may be wise to truncate such observations before doing the analysis. For more detail about the normal model and its scan statistic, see Kulldorff et al. (2006a) [35]. Multivariate scan: Sometimes, especially in disease surveillance, the statistical power to detect an outbreak that is present in all data sets may suffer due to low numbers in each data set. Kulldorff’s Multivariate scan model can simultaneously incorporate multiple data sets into a single likelihood function searching for clusters and hence increasing the power. This could be done by defining the combined loglikelihood as the sum of the individual log-likelihoods for those data sets for which the observed case count is more than the expected, if hot-spot clusters are of interest. When searching for clusters with low rates, the same procedure is performed, except that we instead sum up the log-likelihood ratios of the data sets with fewer than the expected number of cases within the window in question. When searching for both high and low clusters, both sums are calculated, and the maximum of the two is used to represent the log likelihood ratio for that window. In multivariate scan, all data sets must use the same probability model and the same geographical coordinates file. For more detail, see Kulldorff et al. (2007) [36]. Elliptic window scans: The above Kulldorff’s scan statistics commonly use a circular scanning window. To have more flexibility, the elliptic version of the Kulldorff’s scan statistics uses a scanning window of variable location, shape (eccentric19 ity), angle and size, and with and without an eccentricity penalty. The mathematical principles behind the scan are identical for circular, elliptic or any other shape of the window, with the only difference being the collection of candidate cluster areas considered. In general, the elliptic scan statistic performs well for circular clusters, and equally important, the circular scan statistic performs well for elliptic clusters also. One possible advantage of the elliptic versus the circular scan statistic is that the former may give a better estimate of the true cluster area especially when the true cluster is an elongated one. But the circular scan statistic requires fewer computing resources. For more detail, see Kulldorff et al. (2006) [34]. 20 Chapter 3 Semiparametric Scan Statistics Method 3.1 Overview We have described Kulldorff’s scan statistics method in the previous Chapter. For different types of data, Kulldorff’s method needs to choose different probabilistic models. In this dissertation, we propose a semiparametric scan statistics method [26]. It uses the same scanning scheme as Kulldorff’s, but a semiparametric method to develop the scan statistics and test the significance of the cluster candidate. To take into account the multiple testing problem induced by numerous overlapping windows, Stoney’s q-value method [60, 61], a false discovery rate (FDR) methodology [3], is used in conjunction with the semiparametric testing procedure [70]. In the first section of this chapter, I first introduce the semiparametric density ratio model used in this research. Secondly, I discuss how the semiparametric density ratio model is applied to cluster detection and its advantages. In the last section of this chapter, I describe the concept of FDR methodology as well as the q-value used in this work. 21 3.2 Semiparametric Density Ratio Model 3.2.1 The Model The Semiparametric method we use here is based on a density ratio model studied by Fokianos et al. (2001), and Qin and Zhang (1997) [13, 53]. Consider m independent samples, x1 = (x11 , x12 , . . . , x1n1 )0 ∼ g1 (x) x2 = (x21 , x22 , . . . , x2n2 )0 ∼ g2 (x) .. . xm = (xm1 , xm2 , . . . , xmnm )0 ∼ gm (x) where gj (x) is the probability density function of xji , j = 1, . . . , m, i = 1, . . . , nj . Choosing the mth sample as the reference sample and gm (x) as the reference density, it is assumed that the density ratio between the jth density and the reference density has an exponential form as in (3.1), gj (x) = exp{(αj + β 0j h(x))}, gm (x) j = 1, . . . , q, q =m−1 (3.1) Notice that h(x) is a known function of x which may take on a scalar form such as x, x2 , or log x, or a vector-valued form such as (x, x2 )0 , or (x, log x)0 , and so on. See more in subsection 3.2.2. Here αj = αj (β j ) is a scalar, but β j could be a scalar or vector depending on h(x). Clearly, β j = 0 implies αj = 0, and the hypothesis H0 : β 1 = β 2 = . . . = β q = 0 implies all the m samples come from a common distribution with probability density gm (x) ≡ g(x). In this section we shall assume 22 that function h(x) is p-dimensional; notice that p and q here are different from p and q in the previous chapter. An example of the exponential density ratio model (3.1) is provided by multinomial logistic regression upon an appeal to Bayes theorem. Consider a categorical random variable y such that P (y = j) = πj , where f (x|y = j) = gj (x), 1, . . . , m, and Pn i=1 j = πj = 1. If exp{αj + β 0j h(x)} P P (y = j|x) = , 1 + qk=1 exp{αj + β 0j h(x)} j = 1, 2, ..., q, q =m−1 then by Bayes rule, model (3.1) holds with αj = αj∗ + log(πm /πj ), j = 1, 2, ..., q 3.2.2 Choice of the Tilt Function The semiparametric method requires choosing an appropriate tilt function h(x). A clue of how to choose a satisfactory h(x) for a given situation can be derived from common exponential families as we show in some examples with m = 2 below. More examples can be found in Kay and Little (1987) [24]. Bernoulli distribution: For Bernoulli(p), the density ratio is g1 (x) px1 (1 − p1 )1−x = x g2 (x) p2 (1 − p2 )1−x p1 1 − p1 1 − p1 + (log − log ) · x}. = exp{log 1 − p2 p2 1 − p2 So we obtain α = log 1 − p1 , 1 − p2 β = log p1 1 − p1 − log , p2 1 − p2 and p1 > p2 ⇔ β > 0. 23 h(x) = x, Poisson distribution: For Poisson(λ), similarly we have α = −(λ1 − λ2 ), β = log λ1 , λ2 h(x) = x and λ1 > λ2 ⇔ β > 0. Normal distribution: For N(µ, σ 2 ), unequal means and variances, σ2 µ2 µ2 α = ln( ) + 22 − 12 , σ 2σ 2σ1 1 2 β11 β = β12 . (µ1 σ22 − µ2 σ12 ) σ12 σ22 = , . (σ12 − σ22 ) 2σ12 σ22 h(x) = (x, x2 )0 If σ1 = σ2 , then µ1 > µ2 ⇔ β > 0, and h(x) = x. 3.2.3 Parameter Estimation of the Model Let t = (t1 , t2 , ..., tn )0 = (x01 , x02 , . . . , x0nm )0 denote the combined data from the m samples, and put pi = dG(ti ), i = 1, 2, ..., n where n = n1 + · · · + nm . Then the likelihood becomes L(α, β, G) = n Y i=1 pi n1 Y exp{α1 + β 01 h(x1j )} . . . j=1 nq Y exp{αq + β 0q h(xqj )}. (3.2) j=1 Following a profiling procedure discussed in Fokianos et al. (2001), Qin and Lawless (1994), and Qin and Zhang (1997) [13, 52, 53], first express each pi in terms of α, β and then substitute the pi back into the likelihood to produce a function of α, β only. When α, β are fixed, the likelihood (3.2) is maximized by maximizing 24 only the product term n Q pi , subject to the m constraints, i=1 n X pi = 1, i=1 n X pi [wj (ti ) − 1] = 0, j = 1, . . . , q i=1 where α = (α1 , α2 , ..., αq )0 , β = (β 01 , β 02 , ..., β 0q )0 , and ωj (t) = exp{αj + β 0j h(t)} [51]. The maximization employs the method of Lagrange multipliers, the first of which becomes λ0 = n, and the rest are expressed by construction as λj = νj n, j = 1, . . . , q, for some νj . It follows that pi = 1 1 · , n 1 + ν1 (ω1 (ti ) − 1) + · · · + νq (ωq (ti ) − 1) (3.3) which together with the constraints gives a set of equations n 1 X ωj (ti ) − 1 · = 0, n i=1 1 + ν1 (ω1 (ti ) − 1) + · · · + νq (ωq (ti ) − 1) j = 1, . . . , q. (3.4) Substitute pi in L(α, β, G), the log-likelihood becomes up to a constant, ` ≡ log L(α, β, G) = − n X log[1 + ν1 (ω1 (ti ) − 1) + . . . + νq (wq (ti ) − 1)] i=1 + q ni X X (αi + β 0i h(xij )). (3.5) i=1 j=1 To get expressions for νj , we set ∂l/∂αj = 0, j = 1, . . . , q, and using equation (3.4), we obtain νj = nj , n j = 1, . . . , q. Substituting these values of νj in equation (3.3), we have pi = 1 1 · , nm 1 + ρ1 ω1 (ti ) + · · · + ρq ωq (ti ) 25 (3.6) where ρj = nj /nm , j = 1, . . . , q, and the value of the profile log-likelihood up to a constant as a function of α, β only is `(α, β) = − n X log[1 + ρ1 w1 (ti ) + . . . + ρq wq (ti )] i=1 q ni X X + (αi + β 0i h(xij )) . (3.7) i=1 j=1 The term log[1 + · · ·] is due to the definition of the ρj and ωj (ti ). The score equation for j = 1, . . . , q are therefore, n P ρj ωj (ti ) ∂l = − ∂αj 1+ν1 ω1 (ti )+···+νq ωq (ti ) + nj = 0 ∂l ∂βj = − i=1 n P i=1 ρj h(ti )ωj (ti ) 1+ν1 ω1 (ti )+···+νq ωq (ti ) + nj P i=1 . nj = 0 Solving the above score equation, we obtain the maximum likelihood estimators of the parameters α and β, and consequently by substitution also p̂i = 1 1 · 0 nm 1 + ρ1 exp{α̂1 + β̂ h(ti )} + · · · + ρq exp{α̂q + β̂ 0 h(ti )} 1 q (3.8) therefore the maximum likelihood estimator of the reference distribution function G(x) is obtained by summing over p̂: n I(ti ≤ x) 1 X Ĝ(x) = · . 0 nm i=1 1 + ρ1 exp{α̂1 + β̂ h(ti )} + · · · + ρq exp{α̂q + β̂ 0 h(ti )} 1 q (3.9) It is argued in the Appendix that the estimators α̂, β̂ are asymptotically normal as n → ∞, √ α̂ − α0 ⇒ N(0, Σ). n β̂ − β 0 26 (3.10) The vectors α0 and β 0 denote the true parameters, and Σ = S −1 V S −1 , where the matrices S, V extend the results in Fokianos et al. (2001) [13] to a vector tilt and are also given in the Appendix. See Lu (2007) for a more detailed proof [40]. 3.2.4 Hypothesis and Test Statistics The null hypothesis H0 : β = 0 implies distributional homogeneity: g1 (x) = g2 (x) = ... = gm (x) ≡ g(x). We can use several test statistics to test this hypothesis. See Fokianos et al. (2001), Keziou and Leoni-Aubin (2005) and Fokianos (2006) for details [13, 27, 14]. χ1 test statistic: Define a symmetric matrix A11 in terms of the relative sample sizes ρj , A11 = P ρj [1+ qk6=j ρk ] Pq , (1+ k=1 ρk )2 −ρj ρ 0 P j , (1+ qk=1 ρk )2 if j = j 0 if j 6= j j = 1, 2, . . . , q (3.11) 0 Then A11 is nonsingular. Under H0 , we deduce from the Appendix A11 ⊗ E[h0 (t)] A11 S= A11 ⊗ E[h(t)] A11 ⊗ E[h(t)h0 (t)] and 0 0 V = 0 A11 ⊗ V ar[h(t)] where V ar[h(t)] is the covariance matrix of h(t) with respect to the reference distribution and all moments (E[h0 (t)], E[h(t)h0 (t)]) are evaluated with respect to the reference distribution. See Appendix for the details. The sub-matrices defining S 27 and V have dimensions q × q, q × qp, qp × q, qp × qp, respectively. Consider the Wald-type statistic, 0 χ1 = nβ̂ (A11 ⊗ V ar[h(t)])β̂. (3.12) This is an extension to a vector-valued h(t) of the χ1 test statistic reported in Fokianos et al. (2001). It follows under H0 that χ1 is approximately distributed as χ2 with qp degrees of freedom, and H0 can be rejected for large values. Here q = m − 1 and p is the length of β j which depends on the choice of the tilt function h. For example, if h(x) = x, then p = 1, and if h(x) = (x, x2 )0 , then p = 2, and so on. The particular form of the χ1 statistic (3.12) is due to the great simplification of V , S under the hypothesis. χ2 test statistic: A general linear hypothesis Hθ = c can be tested by means of χ2 = n(H θ̂ − c)0 (HΣH 0 )−1 (H θ̂ − c) (3.13) where θ = (α1 , . . . , αq , β 01 , . . . , β 0q )0 , H is p0 × [(1 + p)q)] predetermined matrix of 0 rank p0 , p0 < (1 + p)q, c is a vector in <p , and the variance-covariance matrix Σ = S −1 V S −1 . It follows under H0 that χ2 is asymptotically distributed as χ2 with (p0 ) degrees of freedom provided the inverse exists (Sen and Singer, 1993, page 239) [56], and H0 is rejected for large values. Basically, the χ1 test and the χ2 test are both Wald type tests. The simulation results show that the χ2 test is slightly more powerful than the χ1 test. But the χ1 test is easy to apply without inverting the S matrix, while the χ2 test can be easily generalized to test any linear function of the parameter β. 28 Likelihood ratio test statistic: A third possibility is to use the likelihood ratio test (LR-test), LR = −2[`(0, 0) − `(α̂, β̂)] = −2 n X log[1 + ρ1 ŵ1 (ti ) + . . . + ρq ŵq (ti )] i=1 +2 q ni X X " [α̂i + 0 β̂ i h(xij )] + 2n log 1 + q X # ρi (3.14) i=1 i=1 j=1 Under H0 , LR is asymptotically approximately distributed as χ2 with qp degrees of freedom, and H0 is rejected for large values. In a few certain circumstances, this test is somewhat problematic since (α, β) = (0, 0) is a boundary point, an issue discussed rigorously in Keziou and Leoni-Aubin (2005) [27]. However, our experience indicates that in testing β = 0, the LR-test works very well. 3.3 Semiparametric Cluster Detection Since the null hypothesis H0 : β = 0 means equal distributions, and homogenous distributions means no cluster in the cluster detection problem, the cluster detection problem becomes a special case for semiparametric density model with m = 2. Similar to Kulldorff’s scanning procedure, the semiparametric cluster detection method applies the density ratio model to movable variable-size scanning window to scan the whole study region and performs for each window a two-sample test without assuming a specific probability model. The data can be either continuous or discrete. Since the significance comes from the χ2 -test, there is no need to do the time consuming Monte Carlo hypothesis testing procedure, hence it is not nec29 essary to know a priori the number of cases in the region. We use the same scanning procedure as Kulldorff’s and select the primary cluster candidate corresponding to the largest test statistic or the smallest p-value (or q-value in the case of multiple testing) as the true cluster. The following is an illustration of semiparametric cluster detection. Consider the 5 × 5 region consisting of 25 cells shown in Figure 3.1(a). Within each cell, numbered from 1 to 25, there are hundreds of binary observations generated randomly. The rate in one of the cells is higher than the rate in the rest of the region. This is the true cluster to be detected. We applied to these simulated data both Kulldorff’s scan statistic method with the Bernoulli model and the semiparametric density ratio method with scalar h(x) = x. Starting with the first cell, the window size varied from a size roughly as large as a cell size to no more than 50% of the whole study region. This was repeated for each cell. Figure 3.1 (b)-(e) shows some snapshots of the intermediate stages during scanning of the whole study region. Both methods detected correctly the true cluster shown in Figure 3.1(f). In addition, for the case of m = 2, the semiparametric χ1 test and likelihood ratio test can be simplified to a simpler form as follows. A one-sided test can also be obtained from χ1 test when h(x) is scalar function. χ1 test statistic: χ1 ≡ 0 n β̂ 1 µ ¶ ρ1 V ar[h(t)] β̂ 1 (1 + ρ1 )2 (3.15) where ρ1 = n1 /n2 and V ar[h(t)] is the covariance matrix of h(t) with respect to the reference distribution. It follows under H0 that χ1 is approximately distributed 30 Figure 3.1: (a) The whole study region. (b,c,d,e) Intermediate stages during the scan. (f) The red region is the true cluster. The true cluster was detected by both methods. 31 as χ2 with p degrees of freedom, where p is the dimension of β 1 , which is the same as that of the function h(x). The null hypothesis H0 : β 1 = 0 is rejected for large values of χ1 . In practice, V ar[h(t)] is replaced by its estimator [13, 26]. Moreover, if the data are 0-1 binary data, the χ1 statistic takes the direct form of equation (A.21) in the Appendix. Likelihood ratio test statistic: LR ≡ −2 n X log[1 + ρ1 · exp{α1 + β 01 h(ti )}] + 2 i=1 n1 X 0 [α̂1 + β̂ 1 h(x1j )] j=1 + 2n log(1 + ρ1 ) (3.16) Under H0 : β 1 = 0, LR is asymptotically approximately distributed as χ2 with p degrees of freedom, and H0 is rejected for large values. Recall that p is the dimension of β 1 and it depends on the choice of the function h(x). Similarly, when the data are 0-1 binary, the likelihood ratio statistic reduces to equation (A.22) in the Appendix. Interestingly, this semiparametric likelihood ratio statistic is equivalent to Kulldorff’s scan statistic under the Bernoulli model [70]. One-sided test statistic: The χ1 , χ2 and LR test statistics are all two-sided tests of H0 : β = 0. When m = 2 and h(x) takes a scalar form, such as h(x) = x or h(x) = log x, the parameter β becomes a scalar, so it is possible to derive a one-sided test as in equation (3.17): Z1−sided = β̂ − β0 → N (0, 1) σβ (3.17) where β0 = 0 and σβ2 is the variance of β̂ obtained from the covariance matrix Σ. In practice, under H0 , this one-sided test statistic can be obtained by the square root 32 of the χ1 test statistic, keeping the same sign as β̂ − β0 . The semiparametric approach has several advantages as follows: À The reference (or background) distribution, G(x), and all the parameters such as β 1 are estimated from the combined data t, not just from a single sample either inside the window or outside the window. Á For a properly chosen h(x), the above tests are quite powerful. Gagnon (2005) shows that for m = 2 the χ1 -test can be more powerful than the common ttest for a known h(x) but unspecified distributions [16]. Moreover, simulation results indicate that the χ1 -test competes well with the corresponding F -test (Fokianos et al., 2001) [13].  In testing equidistribution within exponential families, other than an assumption regarding the tilt function h(t), the semiparametric density ratio method does not require specific distributional assumptions. à The semiparametric method can be applied to either continuous or discrete distributions. Ä Assuming sufficient large samples, since the asymptotic distributions of the above mentioned test statistics are known, in principle there is no need for the time consuming Monte-Carlo methods to compute the p-values. In case of small sample size, then we can still bear with Monte-Carlo methods to get p-values. 33 Since each scan window is associated with a semiparametric statistic during scanning, the method results in a large number of tests and test statistics. To alleviate this multiple testing problem and reduce the possibility of false significance, a control of false discovery rate (FDR) procedure is employed. The following section gives a brief description of the FDR methodology and Storey’s q-value method we used in this paper. 3.4 FDR Method and q-value To account for the multiple-testing problem induced by the large set of overlapped scanning windows, we use Storey’s false discovery rate (FDR) method to derive the significance of the detected cluster candidate. This FDR method replaces the original p-value of each scan window by a q-value. We briefly describe in this section the FDR methodology as well as the q-value method used in this research. Controlling the false discovery rate (FDR) is a less conservative way to handle multiple testing problems. It was first proposed by Benjamini and Hochberg (1995) [3]. Since then, the FDR methodology has been further developed and applied in many fields, especially in genomic research [66, 54]. FDR is defined to be the expected proportion of falsely rejected hypotheses (false positives) as in equation (3.18): ¸ ¸ · V V |R > 0 · P r(R > 0). F DR = E =E max(1, R) R · (3.18) where V and R are defined in Table 3.1. From the table it is clear that if m = m0 , then all the null hypotheses are true, and FDR is equivalent to the family-wise 34 error rate (FWER). To see that, recalling that FWER is defined as P (V ≥ 1), m = m0 makes S = 0, and hence E[V /R|R > 0] = 1 for all R > 0, and therefore F DR = 1 ∗ P (R > 0) = P (V ≥ 1) = F W ER. If m0 < m, then F DR ≤ F W ER, which means a potential gain in power at the cost of increasing the likelihood of making type I errors [3]. Table 3.1: Classification of m hypothesis tests Hypothesis # Accepted # Rejected Total # of true null Hypotheses U V m0 # of true alternatives T S m1 Total W R m Storey (2002) and Storey et al. (2004) improved the original Benjamini and Hochberg FDR methodology by estimating π0 = m0 /m, the proportion of true null hypotheses [60, 63]. In addition, Storey used pFDR as in equation (3.19) instead of FDR (3.18): · ¸ V pF DR = E |R > 0 R (3.19) In many cases, when m, the total number of hypotheses, is large, there always are significant ones, which makes P r(R > 0) ≈ 1. Thus, pFDR (eq. 3.19) is close to FDR (eq. 3.18) in numerical value, but pFDR has some conceptual advantages. For instance, when the rejection region is smaller, namely α-level goes to 0, the quantity of FDR goes to 0 as P r(R > 0) goes to 0. It doesn’t mean the actual chance of false positive decreases to 0. However the quantity of pFDR goes to the 35 π0 , which is what we would expected. Since when the rejection region is smaller until only one p-value falls into the region, without any information about the null of alternative hypotheses, it makes sense to use π0 , the proportion of the nulls among all hypotheses, as the estimation of the chance of the false positive. A thorough motivation of using pFDR rather than FDR can be found in Storey (2003) [62]. Recall that the p-value gives an error measurement of an observed statistic with respect to type-I error. In a general setting, the p-value of an observed statistic T = t is defined to be p-value(t) = min {P r(T ∈ Γα |H = 0)} Γα : t∈Γα where H = 0 means under the null hypothesis and Γα is a nested rejection region parameterized with α and, for α ≤ α0 , Γα ⊆ Γα0 holds. The q-value is defined to be the pFDR analogue of the p-value. It gives the error measurement with respect to pFDR for each observed test statistic of each particular hypothesis. More precisely, the q-value of one particular observed test statistic T = t from a set of tests can be defined to be q-value(t) = inf {pF DR(Γα )}. {Γα : t∈Γα } (3.20) In this way, the q-value is the minimum pFDR that can occur when rejecting a statistic with value t for the set of nested rejection regions. In addition, for a set of hypothesis tests conducted with independent p-values, the q-value of the corresponding observed p-value can be simplified to q-value(t) = inf {pF DR(γ)} = inf { γ≥p γ≥p 36 π0 γ } P r(P ≤ γ) (3.21) where the rejection region is denoted by the p-value interval [0, γ] for some γ ≥ 0 instead of the more abstract rejection regions Γ. Storey J.D. and Tibshirani R. (2001, 2003) shows that the q-value method holds similar properties under either independence or dependence cases [59, 61]. In our semiparametric scan situation, we generate a lot of overlapping scan windows, and each window associates to a hypothesis test and a test statistic, so m here is usually large. Because of the large m, we adopted the algorithm in Storey and Tibshirani (2003) [61] to estimated the q-value for each scan window as follows: 1. Obtain the p-value for each scan window, and sort: p(1) ≤ p(2) ≤ . . . ≤ p(m) . 2. Estimate π0 using a cubic spline function. First, for a range of λ, say λ = 0, 0.01, . . . , 0.95, calculate π̂0 (λ) = #{p(j) ≥ λ} for each λ. m(1 − λ) Then let fˆ(λ) be the natural cubic spline fit to π̂0 (λ). Finally, set the estimate of π0 to be π̂0 = fˆ(1). 3. Calculate µ q̂(p(m) ) = min t≥p(m) π̂0 m t #{p(j) ≤ t)} ¶ = π̂0 p(m) , where 0 < t < 1. 4. For i = m − 1, m − 2, . . . , 1, calculate the estimated q-value for the ith most significant one as µ q̂(p(i) ) = min t≥p(i) π̂0 m t #{pj ≤ t)} ¶ µ = min 37 ¶ π̂0 m p(i) , q̂(p(i+1) ) . i 5. Choose the region with the largest test statistic, for example, the largest likelihood ratio test statistic, as the primary cluster candidate, and its q-value is q(p(1) ), the smallest q-value among all the tests. If q(p(1) ) is less than a predecided false discovery rate, say q(p(1) ) < 0.05, we claim there is clustering and the located primary candidate is a true cluster region. 38 Chapter 4 Power Study In this chapter, we study the power of the semiparametric cluster detection method compared with Kulldorff’s method through Monte Carlo simulations. One type of study is a limited power study where the location and the size of the clusters are already known without scanning. This limited power study generated power curves for semiparametric method, Kulldorff’s method, and the focused test for data under different probability distributions. Another type of study is a comprehensive power study comparing the semiparametric method and Kulldorff’s method without knowing any information about the clusters. In this study one must scan the whole study region to locate the cluster candidate. Since both Kulldorff’s and the semiparametric scan statistics methods are suitable for both binary and non-binary data, we use both types of data to perform the comprehensive power study. Both the limited and comprehensive power studies show that for binary data, the semiparametric cluster detection method and its competitor, Kulldorff’s celebrated scan statistics method, both achieve similar high power in detecting unknown hot-spot clusters. When the data are not binary, the semiparametric methodology is still applicable, but Kulldorff’s method may not be as it requires the choice of a correct probability model, namely the correct scan statistic, in order to achieve power comparable to that achieved by the semiparametric method. Kulldorff’s method 39 with an inappropriate probability model may lose power. 4.1 Limited Power Study The main purpose of the limited power study is to compare the performance of the semiparametric and Kulldorff’s method in determining the significance of a known cluster candidate. This can be criticized on the grounds that we do not scan the area for clusters without the prior knowledge of where those clusters are located. When the cluster location is known, it is more appropriate to compare the semiparametric method with what is known as focused tests described in Waller and Lawson (1995) and Lawson et al. (1999) [68, 38]. So we first briefly introduce the Lawson-Waller focused test in the following subsection. 4.1.1 Focused Tests Focused tests detect clusters with increased risk of disease relative to a source of exposure or focus. Since the problem of multiple testing in cluster detection is avoided, focused tests tend to have higher power than cluster detection tests. A well known focused test is the Lawson-Waller score test applied in disease surveillance [68]. Accordingly, the study area is divided into I subregions where the population size in subregion i is ni , i = 1, . . . , I. Denote the number of cases in region i by Ci . The null hypothesis is that the Ci are independent Poisson with mean E(Ci ) = λni , i = 1, . . . , I, against the (hot spot) alternative H1 : E(Ci ) = λni (1 + gi ²), 40 i = 1, . . . , I (4.1) where gi is a measure of exposure to a focus, and ² > 0 controls the increase in risk. A reasonable test statistic is the Lawson-Waller score statistic U= I X gi · (Ci − E(Ci )) (4.2) i=1 which under the null hypothesis has mean 0 and variance PI i=1 gi2 λni , and U/ p V ar(U ) is asymptotically standard normal. This statistic can be used in testing for trend in Poisson random variables. 4.1.2 Data and Simulation plan We consider two regions, A and B, where A consists of 100 subregions or cells, and B of 1000 cells (except that in the Bernoulli case below the number of cells were 200 and 5000, respectively). The population size in every cell is identical. The smaller region can be thought of as a cluster candidate, whereas the larger region could represent the rest of the area, or some reference or baseline region. The incidence rates in every cell in A, represented by either a case probability, mean, or occurrence rate, are identical. The same holds for B. Thus, in terms of occurrence rate, it is λA in every cell in A, and it is λB in every cell in B. Independent count data were generated in an identical manner in every cell in A, one count observation per cell, and likewise, independent count data were generated in the same way in every cell in B, a single count observation per cell. In the Bernoulli case every cell contains either 0 or 1. The parameters for B never change, but those for A change relative to B. 41 In this way two samples were generated repeatedly from “within the window” (from A) and “outside the window” (from B), respectively, with the same or different parameters, as needed. In our study, the sample size “within the window” is smaller than the sample size “outside the window” since in practice the size of a true cluster tends to be small relative to the whole study region. In this setup, the Lawson-Waller focused test assumes independent Poisson cell counts with λA = λB under H0 , versus λA = λB (1 + gi ε) under H1 , and we let 1 if cell i is in A gi = 0 if cell i is in B When β is a scalar, we use the Z test statistic of equation (3.17) to compare the detecting power with Kulldorff’s scan statistic. When β is not a scalar, we use the χ1 , χ2 and LR test statistics in two-sided tests, and adjust the original Kulldorff test into a two-sided test. Similar remarks hold for focused tests. In this way we compare one-sided with one-sided and two-sided with two-sided tests. The following series of figures shows the results of the power simulation. Each power curve was obtained from 300 runs, and the size of all the tests was controlled at the same level of 5%. For Kulldorff’s Monte-Carlo hypothesis test we used 10000 replications. 42 4.1.3 Results for Various Probability Distributions Bernoulli: Figure 4.1 shows the estimated power curves in the Bernoulli case for one-sided tests. The null probability is p0 = 0.03 and p ranges in the interval [0.03, 0.13]. Kulldorff’s method is applied under the assumption the data are Bernoulli, whereas the semiparametric method is applied with h(x) = x. Evidently, the Lawson-Waller score test, designed for count data, dominates both Kulldorff’s and the semiparametric tests, and the latter two exhibit very close power curves. 0.4 Power 0.6 0.8 1.0 Bernoulli Case 0.0 0.2 Kulldorff Semi-1sd LWfocus1sd 0.04 0.06 0.08 0.10 0.12 p Figure 4.1: Power curves for one-sided tests in the Bernoulli case. Scalar β, h(x) = x. The focused test dominates the two other tests. 43 Poisson: Figure 4.2 shows the power curves for one-sided tests and Poisson data. The Poisson parameter ranges from the null intensity of 5.0 to 6.5. Kulldorff’s method is applied under a Poisson model, and the semiparametric method is applied with h(x) = x. The power curves of the semiparametric and focused tests are fairly close, and both dominate Kulldorff’s. 0.4 Power 0.6 0.8 1.0 Poisson Case 0.0 0.2 Kulldorff Semi-1sd LWfocus1sd 5.0 5.5 6.0 6.5 lambda Figure 4.2: Power curves for one-sided tests in the Poisson case. Scalar β, h(x) = x. The power curves from the semiparametric and focused tests are fairly close. 44 Clipped Poisson: Figure 4.3 shows the power curves for one-sided tests and data generated from clipped Poisson observations. The parameters are the same as in the previous Poisson case. Equation 2 z= x 10 4.3 describes the clipping operation, (x <= 2) (4.3) (2 < x <= 10) (x > 10) Kulldorff’s method is applied under the Poisson model, and the semiparametric method still uses h(x) = x. The semiparametric method gives relatively higher power, apparently due to hard limiting. 0.4 Power 0.6 0.8 1.0 Truncated Poisson 0.0 0.2 Kulldorff Semi-1sd LWfocus1sd 5.0 5.5 6.0 6.5 lambda Figure 4.3: Power curves for one-sided tests in a clipped Poisson case. Scalar β, h(x) = x. The semiparametric method gives relatively higher power. 45 Quantized Normal I : In this and the next two examples we turn to count data generated by quantizing normal observations. This is motivated by real situations when the data are non-Poisson count data, but not knowing the true distribution the Poisson assumption is made nonetheless. In the present case the Poisson assumption is sensible up to a point as our simulation shows. The semiparametric method obviates this assumption. The quantized data were obtained from the integer part of the original normal data. The original normal samples share the same variance (σ 2 = 16), but the mean µ of the A samples ranges from the null µ0 = 9 to µ = 12. Kulldorff’s method is applied under the Poisson model, the semiparametric method uses h(x) = x, and the tests are one-sided. Figure 4.4 shows the resulting power curves. The focused test dominates both Kulldorff’s and the semiparametric tests, and the last two perform very similarly. However, from Figure 4.5, the situation changes dramatically for the same variance 16 but much higher means ranging from 50 to 53. This time the semiparametric test clearly dominates the two other tests. The situation here resembles that of the doubly truncated Poisson case depicted in Figure 4.3 since the quantized data stay away from very small and very large values with a high probability. 46 0.4 Power 0.6 0.8 1.0 Quantized Normal I (Low) 0.0 0.2 Kulldorff Semi-1sd LWfocus1sd 9.0 9.5 10.0 10.5 11.0 11.5 mean Figure 4.4: Power curves for one-sided tests applied to Quantized normal samples with the same variance 16 but different means. Scalar β, h(x) = x. The focused test gives higher power. 0.4 Power 0.6 0.8 1.0 Quantized Normal I (High) 0.0 0.2 Kulldorff Semi-1sd LWfocus1sd 50.0 50.5 51.0 51.5 52.0 52.5 mean Figure 4.5: Power curves for one-sided tests applied to Quantized normal samples with the same variance 16 but different relatively high means. Scalar β, h(x) = x. The semiparametric test clearly dominates the other two tests. 47 Quantized Normal II : Figure 4.6 shows the power resulting from two-sided tests applied to integer quantized normal data as in the previous example. The quantized samples are derived from normal data with the same mean µ = 13 but different variances, respectively, where the variance ranges from 4 (null) to 10. Kulldorff’s method is applied under the Poisson model, and the semiparametric method uses h(x) = (x, x2 )0 , a model suggested by the normal distribution. In this case the three semiparametric tests are much more powerful than the other two tests whose power is almost identical. Kulldorff’s method with the Poisson model and the focused count model seem not appropriate for the present case. 0.4 Power 0.6 0.8 1.0 Quantized Normal II 0.0 0.2 Kuldof2sd LR Chi2 Chi1 LWfocus2sd 4 5 6 7 8 9 10 variance Figure 4.6: Power curves for two-sided tests applied to quantized normal samples with the same mean but different variances. The semiparametric method uses h(x) = (x, x2 )0 , χ1 , χ2 , LR. The semiparametric tests markedly dominate the two other tests. 48 Quantized Normal III : This case is the same as the previous one except that both the means and the variances are different. Recall that here and elsewhere, the null hypothesis is equidistribution. The mean ranges from the null of µ = 20 to µ = 21, and the corresponding variance is σ 2 = 4 when µ = 20, and is σ 2 = 7 otherwise. From Figure 4.7 we see again that the three semiparametric tests are much more powerful than Kulldorff’s and the focused test. This and the previous examples give an indication that Kulldorff’s test and the Lawson-Waller focused test may not be suitable for integer samples with substantially different variances. 0.4 Power 0.6 0.8 1.0 Quantized Normal III 0.0 0.2 Kuldof2sd LR Chi2 Chi1 LWfocus2sd 20.0 20.2 20.4 20.6 20.8 mean Figure 4.7: Quantized normal case as in Figure 4.6 but with different means and different variances. The semiparametric tests clearly dominate the two other tests. From the above power results we find that Kulldorff’s method and the LawsonWaller focused test perform well for some types of count data but may lose power for count data with non-homogeneous variance, as well as count data which are far from 49 being Poisson. In contrast, the semiparametric method seems to perform relatively well under very different settings, without specifying any distribution except for the choice of the tilt function. In particular, our simulation results indicate that this semiparametric method is potentially useful across different types of data with changing regional means and/or variances. With a properly chosen tilt function h(x), the method can detect changes in both the mean and the variance. 4.2 Comprehensive Power Study: Overview We compared the power of Kulldorff’s and the semiparametric scan statistics methods in detecting potential clusters. Because there are various cluster patterns, with a single or multiple clusters, each cluster region may contain one or more counties or states. For simplicity, in our power comparison, exact accuracy is not required, we focus more on the existence instead of precise delineation of the cluster region . For instance, for a data set with a pattern of multiple cluster regions and multiple counties in each cluster region, we deem the detection successful whenever a significant q-value is obtained. We do not strictly require the detected cluster region to be exactly the same as originally simulated. The detected cluster region could fully or only partially cover the desired area. 50 4.3 Comprehensive Power Study: Binary Data 4.3.1 Binary Data Set and Simulation Plan For binary data scans, we use the Northeastern U.S.A. purely spatial benchmark data consisting of 245 counties in northeastern United States, from Maine, New York, Rhode Island, Pennsylvania, Maryland, Washington DC, among others (Kulldorff et al., 2003) [32]. Each county is graphically represented by its centroid coordinates. The case data, the numbers of people who have breast cancer, are aggregated to county level with the total number of cases in northeastern states being fixed. The population data of each county are based on the female population of the 1990 census. The benchmark data set contains two types of data, hot-spot clusters and global clustering data. Figure 4.8 is a map of the northeastern U.S. states. The following briefly describes the simulated data and data sets. See Kulldorff et al. (2003) for details [32]. Hot-spot clusters: Data are generated by a first-order clustering model, where cases are located independently of each other and the relative risk is different in different geographical areas. In this US northeastern states benchmark data set, the cluster region can be either a single region containing one or more counties, or a collection of multiple regions, where the risk of breast cancer is much higher than in the rest of the area. Three types of cluster, rural, urban, and mixed, are generated depending on the location of the cluster. A rural cluster is a region which has a small population relative to a large graphical area, such as Grand Isle County 51 Figure 4.8: sketch map of the US northeastern states. in northern Vermont close to the Canadian border. An urban cluster is a region which has a large population relative to a small graphical area, such as New York County which includes Manhattan. A mixed cluster is a region where a big city is surrounded by rural areas, such as Allegheny county in western Pennsylvania where Pittsburgh is located. Global clustering: Data are generated by purely second-order clustering model, where any one particular case is randomly located, so that the relative risk is constant throughout the whole study region, but the location of cases are dependent on each other. Thus, under the alternative hypothesis of global clustering, cases are clustered wherever they occur in the region. In this benchmark data set, a certain number of cases are first generated to be randomly located throughout the whole northeastern states. These original cases then generate other new cases close 52 by. If each original case generates one additional case, we call them twins; if two additional cases are generated, we call them triplets. The case generation is based on a global chain rN -nearest neighbor rule. The global chain is constructed by a Hamiltonian cycle chain which passes through as many counties as possible exactly once, and any two counties next to each other on the chain always border each other graphically. For twins, the additional case is assigned to county j if P k I(dik < dij )nk < rN ≤ county k, N = P k P k I(dik ≤ dij )nk , where nk is the population size of nk is the total population size, r is some constant in the interval (0, 0.5), and dij is the distance in one particular direction along the chain connecting county i and county j. For triplets, the two new cases are assigned in opposite directions along the chain. Data sets corresponding to different r were generated, where r is either deterministic or randomly selected from a probability distribution. Notice that although the first- and the second-order clustering models are very different in generating the cases, the resulting point patterns may look quite similar, and hence indistinguishable. This benchmark data set includes two groups of data with a total of 600 and 6000 simulated cases, respectively, for both hot-spot and global clustering data sets. The same null hypothesis of no cluster is used throughout where the relative risk for each county is equal, and the cases as well as their locations are independent of each other. In order to perform power comparison, 100000 random data sets with a total of 600 and 6000 cases were generated under the null hypothesis, respectively. These are used to estimate the critical cut-off point of significance. For each alternative 53 hypothesis of clusters which are called scenarios in this paper, 10000 random data sets were generated to estimate the power using the previous determined cut-off points. For each group of fixed total cases, Kulldorff generated 35 hot-spot clustering scenarios and 26 global clustering scenarios for his power comparison. For instance, a scenario of “rural and urban 600, size 4” means a total of 600 cases were generated under the alternative hypothesis that the study region has two hot-spot clusters. One cluster is in a rural region including four counties, and the other one is in an urban region including four counties as well. A scenario of “global clustering twin 6000, exponential 0.02” means a total of 6000 cases were generated under the alternative hypothesis of global clustering. The value r is randomly generated from an exponential distribution with parameter 0.02. In this paper, we did not use all the scenarios in the Northeastern US benchmark data. Instead, we randomly chose one or two scenarios from each clustering pattern. Finally, 9 hot-spot clustering scenarios and 6 global clustering scenarios are used in our power study for the binary data. After selecting these scenarios, the same data sets in each scenario were used for Kulldorff’s method with the Poisson model (the Bernoulli model is also appropriate) and the semiparametric method was applied with the tilt function h(x) = x. In addition, the simplified semiparametric likelihood ratio test statistic (eq. 3.16) is used to accelerate the computation. All the tests are two-sided to detect for either high or low valued clusters. 54 4.3.2 Results for Binary Data The results of power comparison for the binary case-population data using the northeastern US benchmark data set are shown in Figures 4.9 and 4.10 for a total number of 600 and 6000 cases, respectively. All tests are two-sided tests. In each figure, scenario 0 is under the null hypothesis of no cluster, scenarios 1 to 9 are hot-spot clustering scenarios, and scenarios 10 to 15 are global clustering scenarios. Each scenario contains five quantities. They are “Kull.paper”, “Kull.me”, “SemiwFDR=0.1”, “SemiwFDR=0.05”, and “Bonferroni”. “Kull.paper” is the corresponding power copied from Kulldorff et al. (2003) paper. “Kull.me” is the corresponding power computed by us based on Kulldorff Poisson model. The purpose of including “Kull.paper” here is to make sure our programming and computation are correct. If we are correct, the results of ”Kull.me” should be similar to those in “Kull.paper”. From Figures 4.9 and 4.10, we can see that they are almost equal, which confirm the validity of our computation. So in the latter of this section, we will use “Kulldorff’s method” without distinguishing these two power results. “SemiwFDR=0.1” is the corresponding power computed based on a q-value significance level 0.1. “SemiwFDR=0.0.5” is the corresponding power computed based on a q-value significance level 0.05. The smaller the q-value significance level is, the harder it is to reject the null hypothesis, hence the lower the power to detect the cluster. “Bonferroni” is the corresponding power computed based on Bonferroni correction with the family-wise error rate 5%. Because Bonferroni correction is a popular but a conservative approach to handle multiple testing problems, we also 55 included it here to compare with the FDR methodology. We expect “Bonferroni” to have the lowest power among the five. For scenario 0, both Figures 4.9 and 4.10 show that the type I errors of Kulldorff’s method are all exactly 0.05 for both 600 cases and 6000 cases. This is expected since Kulldorff’s method uses a Monte Carlo procedure to derive the cutoff point for the corresponding significance level. Since the significance level for Kulldorff’s method in our power study is 0.05, the power computed from Kulldorff’s method under the null hypothesis, which is the type I error, must be 0.05. The power from semiparametric method under the q-value significance level of 0.1 is 0.052 for the 600 cases, and 0.054 for the 6000 cases. It means that, if we allow a higher false discovery rate such as 10%, which is in favor of the alternative, the type I error of semiparametric method is slightly higher than Kulldorff’s. If the q-value significance level is chosen as 0.05, the type I error of the semiparametric method reduced to 0.027 for both cases. This is also expected since lower q-value significance level works in favor of the null hypothesis. It shows that if one wants to make the power comparison under exactly the same type I error level, say 0.05, the q-value significance level must be set in the interval between 0.05 to 0.1. In addition, the type I error of the semiparametric method with Bonferroni correction is the lowest, which is not a surprise since Bonferroni correction is the most conservative. For scenarios No. 1 to 9, both figures show that the two methods work very well in detecting hot-spot clusters, but Kulldorff’s method seems to be slightly more powerful. When the study area has a stronger pattern of clustering, such 56 Figure 4.9: Power comparison between Kulldorff’s and the Semiparametric with likelihood ratio test methods for binary type data using the northeastern US benchmark data with 600 simulated cases. 57 Figure 4.10: Power comparison between Kulldorff’s and the Semiparametric with likelihood ratio test methods for binary type data using the northeastern US benchmark data with 6000 simulated cases. 58 as containing more cluster regions, the power of both methods increases, which demonstrates the validity of the two methods. For instance, scenario No. 6 in figure 4.9 has a total of 600 cases and two cluster regions where each cluster region contains 16 counties. The power of Kulldorff’s method is 0.996, whereas the power of semiparametric method with q-value significance level of 0.1 and 0.05 obtain the power 0.996 and 0.993, respectively, which is quite close to the power of Kulldorff’s. The semiparametric method with Bonferroni correction is 0.968, which is expected to be the lowest, but still is quite reasonable. For scenarios No. 10 to 15, both figures show that the two methods do not do very well compared with the results in hot-spot detection. This is so because both Kulldorff’s and the semiparametric methods are not designed to detect global clustering pattern. The figures show that Kulldorff’s method is slightly more powerful than semiparametric method with an exception of scenario No. 14. It is also shown that for both methods, the larger the r is, the lower is the detection power. 4.4 Comprehensive Power Study: Non-binary Data 4.4.1 Non-binary Data Set and Simulation Plan For non-binary data scan, we use the simulated ordinal categorical data with one data point corresponding to one observation. The data are aggregated to state levels distributed in 18 states. Most of them are middle south states, including Alabama (AL), Arkansas (AR), Texas (TX), Virginia (VA), etc. Each state is 59 graphically represented by its centroid coordinate. Figure 4.11 shows the map of the states included in our simulated study. Figure 4.11: Map showing the states included in the simulation denoted with color and the abbreviation of state names. The state Illinois with red color is illustrated as one possible cluster region in our simulated data. The ordinal categorical data are integer data generated from quantized normal data obtained from the integer part of the original normal data. We refer to quantized normal II and quantized normal III data as described in the limited power study section. Also see Kedem and Wen (2007) [26]. The quantized normal II data are derived from normal data with the same mean but different variance inside and outside the cluster region, whereas the quantized normal III data are derived from normal data with both different mean and variance inside and outside the cluster region. To see how Kulldorff’s and the semiparametric methods perform when the 60 difference between the cluster and the non-cluster region is small, namely a more difficult cluster detection problem, the data set “Quantized Normal III (small)” was generated. Table 4.1 lists the mean and variance parameters used to generate the quantized normal data. Table 4.1: Parameters for Simulating the Ordinal Categorical Data from Quantized Normal Data Type Inside the cluster Outside the cluster Quantized Normal II µ = 13, σ 2 = 8 µ = 13, σ 2 = 4 Quantized Normal III µ = 7.2, σ 2 = 13 µ = 6, σ 2 = 9 Quantized Normal III (small) µ = 6.5, σ 2 = 13 µ = 6, σ 2 = 9 Figure 4.12: Box Plots of the Simulated the Ordinal Categorical Data 61 For each type of data, the average sample size within each state is around 130, hence there is a total of 130 × 18 = 2340 observations in each generated data set. To perform power comparison within a reasonable time scale, we generated 100 random data sets with a single cluster and also multiple clusters for both Quantized Normal II and Quantized Normal III data, respectively. The single clusters means only one state was randomly chosen as the cluster region. By multiple clusters we mean four states were randomly chosen as the cluster region, where the four states are not necessarily contiguous. Notice that multiple clusters constitute a stronger clustering pattern which in general is easier to detect. In our power comparison using these non-binary ordinal categorical data, the same data sets were used for both methods. Kulldorff’s method was applied with the Poisson model (inappropriate model) and with the ordinal model (correct model), while the semiparametric method was applied with the vector tilt function h(x) = (x, x2 ). We used Kulldorff’s SaTScan v7.0.1 software to conduct the cluster detection by the ordinal model. The results of the power comparison for the ordinal categorical data are shown is Figures 4.13 and 4.14. Observe that it is appropriate to choose h(x) = (x, x2 ) for binary data as well, because in that case the coefficient of x2 term is 0. See more details in the discussion section. 4.4.2 Results for Ordinal Categorical Data The results of power comparison for non-binary ordinal categorical data using the generated middle south US data set are shown in Figures 4.13, 4.14 and 4.15. 62 All the tests are two-sided tests. The significance level for Kulldorff’s method is still 0.05, and the q-value significance level for the tests in semiparametric method is also set at 0.05. This means the true type I error level of the semiparametric method is lower than Kulldorff’s as explained above. For ordinal categorical data which are generated from the quantized normal II type, figure 4.13 shows that semiparametric method with the likelihood ratio test has the highest power of detecting potential clusters among all the tests. The semiparametric method with the χ1 test works well but not as well as the likelihood ratio test. This is in line with the limited power study in Kedem and Wen (2007) [26] who showed the likelihood ratio test was the most powerful tests among three tests from the semiparametric density ratio model. Moreover, when the clustering pattern is stronger, in this case, when the study region contains multiple clusters, the power of χ1 test increase to 0.94, which is almost close to the power of the likelihood ratio test. The power of Kulldorff’s method with Poisson model is very low, because the Poisson model is inappropriate when the variance is not constant Kulldorff’s method with ordinal model is comparable to the semiparametric method with the χ1 test in detecting potential clusters. 63 Figure 4.13: Power comparison between Kulldorff’s and the Semiparametric methods for ordinal categorical data generated from quantized normal II data, where the means are the same but the variances are different, between the cluster region and the rest of the area. 64 For ordinal categorical data which are generated from the quantized normal III type, figure 4.14 shows that semiparametric method performs in the same way as in the quantized normal II case. The likelihood ratio test still has the highest power in both single and multiple clusters situation. The power of all tests increases as the clustering pattern becomes stronger. Kulldorff’s method with the ordinal model works as well as the semiparametric method. Kulldorff’s method with the Poisson model works but less powerful as compared with the other tests. This is because for the quantized normal III data, Kulldorff’s Poisson model can detect changes in the mean while ignoring changes in variances. Figure 4.14: Power comparison between Kulldorff’s and the Semiparametric methods for ordinal categorical data generated from quantized normal III data, where both means and variances are different inside and outside the cluster region. 65 Figure 4.15(a) shows the power results when the difference between the cluster and non-cluster regions is small. The data are still ordinal categorical data generated from the quantized normal III data. However, this time the difference of the mean in the cluster region is set to be very close to the the non-cluster region, while the variance inside and outside the cluster region is kept the same as in the previous case (refer to Table 4.1 for the simulation parameters). In this way, the cluster is more difficult to detect. Not surprisingly, for the single cluster case, the detection power of both Kulldorff’s and the semiparametric method significantly decreases due to the weaker cluster pattern. The power for the likelihood ratio test from the semiparametric method and Kulldorff’s method with the ordinal model continue achieving the highest power, although it is much lower than in the previous case where the differences inside and outside the cluster region are substantial. The power of the χ1 test of the semiparametric method also decreases, and Kulldorff’s method with the Poisson model continues yielding the lowest power which is almost 0. The multiple cluster case is similar to the single cluster case but with a higher power. Interestingly, if we look at the accuracy, which means detecting the true cluster region correctly with its exact size, it is shown as in figure 4.15(b) that the most accurate method is the semiparametric method with the likelihood ratio test statistic, which has a more than 50% higher accuracy rate than Kulldorff’s method with the ordinal model. 66 Figure 4.15: Power comparison between Kulldorff’s and the Semiparametric methods for ordinal categorical data with small differences. The data are in quantized normal III small type. (a) Existence. (b) Accuracy. 67 Chapter 5 Data Analysis examples Chapter 3 illustrates the Semiparametric cluster detection method using simulated data. In Chapter 4, it is shown that the Semiparametric method achieves comparable power to that of the celebrated Kulldorff’s method. In some cases, the Semiparametric method has a higher power. In this chapter, I apply the Semiparametric cluster detection method to real data. 5.1 North Humberside Childhood Leukemia Data Both the Semiparametric method with the likelihood ratio test using tilt function h(x) = x and Kulldorff’s method with the Bernoulli model are applied to a real data set from Kulldorff’s satscan website http : //www.satscan.org/. It gives the spatial location of 62 cases of childhood leukemia and lymphoma in North Humberside, England, between 1974 and 1986, as well as 141 controls (Alexander et al., 1990) [1]. The scientific question is to see if there is some region with a higher disease rate. A snapshot of part of the data set is shown in Figure 5.1, and the spatial locations of the region’s postal zones are shown in Figure 5.2(a). A circular scanning window was used and moved across all postal zones with a variable size ranging from roughly the size of a postal zone to no more than 20% of the study region. Both Kulldorff’s scan statistics and the Semiparametric density 68 Figure 5.1: Snapshot of part of the North Humberside childhood leukemia and lymphoma data set 69 ratio method point to the same cluster shown in Figure 5.2(b), consisting of postal zones (14, 18, 19, 26), as the primary cluster candidate. However, from the software SatScan version 7.0.1 described at http : //www.satscan.org/, Kulldorff’s method gives a p-value of 0.674, nonsignificant, whereas the Semiparametric p-value without adjusting for multiple testing is 0.002. The Semiparametric method coupled with the FDR control gives a q-value of 0.073, which is on the boundary, suggesting that the detected cluster could be a true cluster. Thus, the two approaches lead to very different conclusions as expressed by very different p-values. Which one is correct? Working with the same data, Cuzick et al. (1990) found that the true cluster is likely to consist of 4 postal zones [1]. Moreover, environmental studies by Colt and Blair (1998) and Mckinney et. al. (1991) reported that the association between childhood leukemia and paternal exposure to solvents was quite strong, and that a global cluster was located in North Humberside [5, 41]. Thus, the results from the Semiparametric method are more in line with the medical and environmental studies, and the located cluster candidate as well as its significance given by the Semiparametric method seems more credible. More conclusive results may be obtained by increasing the sample size. 5.2 Maryland-DC-Virginia Crime Data Besides cancer research and epidemiology studies, another important application area of cluster detection is in crime mapping [12, 37]. Actually criminology has a long history of using mapping techniques, such as “colored pin maps”, to help 70 Figure 5.2: (a.) The geographical map. (b). The detected cluster candidate in red. 71 police officers improve public safety. For instance, through mapping crime occurrences, police officers or investigators can determine regions with high crime rate, or figure out the route of drug flow, and so on. This helps the criminal justice or law enforcement specialists to optimize the allocation of resources. In recently years, the advance of computer technology as well as geographic information systems (GIS) make crime mapping widely available [55, 39, 69]. In this section, we apply our Semiparametric cluster detection method with both the χ1 test and likelihood ratio test to detect crime clusters (both hot-spot and cold-spot), if any, using the the 2001 - 2004 data set of the annual number of arrests in Maryland-DC-Virginia since it is believed that high arrest rates indicates high crime rates. The data are from National Consortium on Violence Research (NCOVR) website, which is a research, training, and data resource specializing in violence research. NCOVR was funded in 1995 by a grant from the National Science Foundation in cooperation with the National Institute of Justice. For more information of NCOVR, see its website at (http://www.ncovr.heinz.cmu.edu/). The annual number of arrests data I used in this case study are aggregated to county level (159 counties in total). To adjust the population, I used the 2002 Population data and assume it is fixed throughout the period 2001 to 2004. The spatial coordinates are the longitude and latitude of the center of each county. In reality, there were minor changes in the population during four years, but the changes were small relative to the total population, and it doesn’t hurt to regard the population as constant in a short period. Figure 5.3 gives a snapshot of the crime 72 data in year 2001. Figure 5.3: Snapshot of part of the Maryland-DC-Virginia Crime data set A circular window is used to scan the whole region aiming to detect both high and low risk clusters. The window size is varied ranging from roughly the size of including one county to a maximum of 50% of the study region. The results from the Semiparametric cluster detection method are listed in Table 5.1. It shows that from year 2001 to 2004, although the arrest rate (number of arrests divided by the population) is changing, the primary high crime risk cluster always includes Baltimore county and Baltimore city centered at Baltimore county. The average arrest rate in this Baltimore cluster is four to five times higher than the rest of region. The p-value and q-value are both less than 0.001 showing significant high 73 crime risk. The primary low risk cluster includes Montgomery county, Howard county, Fairfax county, District of Columbia, Falls Church (City), Arlington county, Fairfax City, Alexandria City, Prince George’s county, Loudoun county, centered at Montgomery county. The p-value and q-value are also both less than 0.001 showing significant low crime risk. Figure 5.4 shows the primary high and low crime risk clusters. Table 5.1: Results of High and Low Crime Risk Cluster from Yr 2001∼2004 Primary Cluster High Risk Low Risk Baltimore County Montgomery county 2 counties and cities 12 ∼ 15 counties and cities Relative arrest rate Ratio 4.31 ∼ 5.05 0.31 ∼ 0.36 p-value (from χ1 and LR) < 0.001 < 0.001 q-value (from χ1 and LR) < 0.001 < 0.001 Cluster center Cluster size Figures 5.5 to 5.8 show the arrest rates of each county for each year. The figures also demonstrate the constancy of the primary high and low risk clusters. The results are not surprising. They are all consistent with the economic and demographic factors. Throughout 2001 to 2005, on average, the arrest rate of Baltimore is four to five times higher than the average of the rest of the three states. Moreover, this ratio appears to be increasing. According to crime statistics there were 269 homicides in Baltimore in 2005, giving it the highest homicide rate per 100,000 of all U.S. cities of 250,000 or bigger population [7]. The homicide 74 Figure 5.4: Maryland-DC-Virginia High and Low Risk Crime Cluster. Red means high crime risk cluster, Navy means low crime risk cluster. rate in Baltimore is nearly seven times the national rate, six times the rate of New York City, and three times the rate of Los Angeles. In 2007, the CNN/Morgan Quitno “Most Dangerous City” Rankings (2007) ranks Baltimore as the 12th most dangerous American city. Baltimore is second only to Detroit among cities with a population over 500,000 [8, 9]. The high risk of crime has troubled Baltimore for years. The main reasons for the high crime rate in the Baltimore area are illegal drug trade, dreadful public schools, and lack of jobs. Besides strengthening the police force, it is essential to improve the education level of the school system, cut off drug trade lines, and provide more jobs. On the other hand, the Montgomery County cluster as well as its nearby counties including Fairfax County, etc. have a national reputation for their public educa75 Figure 5.5: Maryland-DC-Virginia Arrest Rate by County in Year 2001 76 Figure 5.6: Maryland-DC-Virginia Arrest Rate by County in Year 2002 77 Figure 5.7: Maryland-DC-Virginia Arrest Rate by County in Year 2003 78 Figure 5.8: Maryland-DC-Virginia ArrestRate by County in Year 2004 79 tion system. Students in Montgomery county public schools score among the top in the United States on Advanced Placement Examinations [71]. The Fairfax County government spends more than half of its fiscal budget on its education system. Its county public school system contains the Thomas Jefferson High School for Science and Technology (TJHSST), a Virginia Governor’s School. TJHSST consistently ranks at or near the top of all United States high schools due to the extraordinary number of National Merit Semi-Finalists and Finalists, the high average SAT scores of its students, and the number of students who annually perform nationally recognized research in the sciences and engineering [72]. In addition, Montgomery County has a very lucrative business climate. It is the epicenter for biotechnology in the Mid-Atlantic region, the third largest biotechnology cluster in the nation. There are many large firms and federal agencies located in Montgomery County, including Lockheed Martin, Marriott International, BAE Systems Inc, Genentech, National Institutes of Health (NIH), Food and Drug Administration (FDA), National Institute of Standards and Technology (NIST), and so on. Those companies and organizations attract a lot of highly educated work force and provide tremendous work opportunities. Thus, it makes the crime rate here remains consistently low. Interestingly, DC and Prince George’s county are also included in this low crime risk cluster. It seems a little different than what you expect because these two regions are known as violent areas. But the data show that the arrest rates in DC and PG county are 0.9% and 3%, respectively. Both of them are less than 4.7%, 80 the overall arrest rate of the three states. This may be due to a large population, although the absolute number of arrests are high. Another reason could be that it is hard for the circular scanning window also to delineate the exact boundary of the cluster. It is also noticed that Worcester County of Maryland where Ocean City is located is a region with a second highest arrest rate, which is almost three times than the overall average. However, the results from the Semiparametric cluster detection method doesn’t show that Worcester County is a secondary cluster candidate. The high arrest rate in Worcester County may due to its small population. 81 Chapter 6 Summary and Discussion In this dissertation, I develop a cluster detection approach by using a Semiparametric method applied to moving windows of variable size as suggested by Kulldorff’s method. The only assumption needed regards the exponential tilt function h(x), but unlike Kulldorff’s method, no specific distributional assumptions are necessary, and the testing procedure with m = 2 is quite simple. Likewise, there is no need to know the number of cases a priori. The practical potential of the method was demonstrated with real and artificial data. It successfully detects potentially high risk clusters (hot-spot) as well as low risk clusters (cold-spot). In addition, the significant tests of the Semiparametric method use χ2 test to obtain the p-value directly, so there is no need to run the time consuming Monte Carlo testing procedure. As an example, for the Maryland-DC-Virginia crime data set in Chapter 5, using my acer laptop, Intel Centrino 1.6GHz CPU, 512M memory, the Semiparametric method in the Splus environment costed around 8 seconds to get the final results. But Kulldorff’s SatScan v7.0.1 software costed around 50 seconds to get the results. The results of the power study show that when detecting localized clusters, both the Semiparametric and Kulldorff’s method achieve comparably good power. For binary population-case data, Kulldorff’s method with the Poisson model may have a slightly higher power than the Semiparametric method with tilt function 82 h(x) = x. For non-binary data, such as ordinal categorical data, the Semiparametric method with tilt function h(x) = (x, x2 )0 is slightly more powerful than Kulldorff’s method with an ordinal model. If Kulldorff’s method is applied with an inappropriate model, that is using the Poisson model to analyze ordinal categorical data, it may fail to detect any potential clusters. For instance, Kulldorff’s method with the Poisson model obtains a very low power for quantized normal II data, while it still works for quantized normal III data, but with a relatively lower power compared with the ordinal model. We also find that in our Semiparametric method the likelihood ratio test seems to have a higher power than the χ1 test in detecting potential clusters. When the localized clustering pattern is strong, for instance, multiple cluster regions or the difference inside and outside the cluster region is large, both tests obtain good power. When the clustering pattern is weak, that is, the difference is not that large, the likelihood ratio test seems to be more acuminous, while the χ1 test could be insensitive to undesired fluctuations. On the other hand, the Likelihood ratio test only can test ”either high or low values”, but χ1 test can be easily transformed into a one-sided test if the tilt function has a scalar form. In practice, it is prudent to use both tests whenever possible for potential clusters. In the power study for non-binary data, we only used the ordinal categorical data, which are integer data, to compare the power of the two methods. However, Kulldorff’s method also offers the exponential model to analyze survival time data and a normal model to analyze continuous data. A future study could compare the power of the Semiparametric method and Kulldorff’s method for continuous data. 83 We expect the Semiparametric method to work well, since the Semiparametric density ratio model was originated designed for continuous data. In addition, Semiparametric model provides a more consistent setup than Kulldorff’s method. Kullorff’s method requires different models, namely different scan statistics, for different types of data, whereas the Semiparametric method requires no specific distributional assumptions except for the exponential tilt function h(x). In practice, the choice of h(x) = (x, x2 )0 is appropriate for many types of continuous and discrete data. If the underlying distribution is known exactly, we choose the true tilt function h(x) to get the best performance. For instance, if the data are from Bernoulli or Poisson distribution, we use the tilt function h(x) = x; if the data are from normal, we use h(x) = (x, x2 )0 ; if the data are from Gamma distribution, we can use h(x) = (x, log x)0 . We may choose the tilt function as h(x) = (x, x2 )0 for binary data as well although the x2 term is not necessary. Appendix A.2 shows that the power of the Semiparametric method does not change since the parameter associated with x2 is 0. Simulation studies have demonstrated that if a term in the tilt function is not necessary, the parameter associated with that term is close to 0 also. A clue of how to choose a satisfactory h(x) for a given situation can be derived from common exponential families (recall Chapter 3). If the underlying distribution is not known, there could be a problem of a misspecified tilt function. Fokianos and Kaimi (2006) demonstrates that a misspecified h(x) could decrease the power of the corresponding tests [14, 15]. Yet, there are examples where very different choices of h could lead to similar test results. For instance, in an application to meteorological data in Kedem 84 et al. (2004), the choice of h(x) as x or log x led to very similar test results [25]. To check whether the assumption of the exponential tilt density ratio model holds, Qin and Zhang (1997) [53] propose a Kolmogorov-Smirnov type statistic to test the goodness of fit of the density ratio model for two-sample case, which also gives a guidance of an appropriate choice of h(x). Lu (2007) extend this goodness of fit test to m-sample case [40]. Interestingly, Kulldorff’s method and the Lawson-Waller focused score test may still perform well even for non-Poisson count data as long as the variance of the observations over the study region does not change much. However, these two methods seem to lose power when the variance changes appreciably over the region. As for the Semiparametric method, it seems that for a non-homogeneous regional variance the choice of h(x) = (x, x2 )0 suggested by the normal distribution is sensible. Shmueli et al. (2006) [57] revive the Conway-Maxwell-Poisson (CMP) distribution to fit discrete data. The CMP distribution is a two-parameter extension of the Poisson distribution that generalizes some well-known discrete distributions (Poisson, Bernoulli and geometric). In this sense, h(x) = (x, logx!) derived from the CMP distribution could also be used as an alternative to h(x) = (x, x2 ) for discrete data. The Semiparametric density ratio model essentially tests the homogeneity or equidistribution of two or more samples, therefore, besides Kulldorff’s circular scan window, the Semiparametric method may also adapt to other shapes of the scanning window or scanning schemes, such as the elliptic window scan, Patil and Taillie’s up- 85 per level set scan and Tango’s flexible scan mentioned in Chapter 1. More precisely, in scanning for clusters, and regardless of regular or irregular shapes of the scanning window, as long as the window separates the whole study region into two samples, one inside the window and one outside the window, the Semiparametric method can be applied. However, the Semiparametric method ignores the information about the location of a cases except whether a case is inside or outside the current window. Thus the Semiparametric method may not have good power for global type clustering clusters as shown in scenarios 10 to 15 in Figure 4.9 and Figure 4.10 where clustering occurs throughout the study region. It is also important to keep in mind that whatever the shape of the most likely cluster, it only indicates the general area of the true underlying cluster, and that the exact boundary of the detected clusters is uncertain. This is sufficient for most practical purposes, as the Semiparametric cluster detection method’s main purpose is to generate a signal with a general idea of where an outbreak or higher than normal activity has occurred. More detailed information about the outbreak, its cause, nature and extent, can only be obtained through detailed investigations by specialists in corresponding areas, who should not only focus on the area within the most likely cluster, but also on neighboring localities. The exact choice of shapes is not of critical importance. If computing resources allow, better results may be obtained using an irregular rather than a circular scan window, depending on the shape of the true underlying cluster. For valid statistical inference, it is important that the choice of the scan window is made a priori though, before analyzing the 86 data, in order to avoid pre-selection bias. A limiting factor of the complete power study in this dissertation is that for non-binary type data we only used 100 runs at one point for each power comparison. That is because it is tedious and time consuming to run SatScan software and document the results manually. In addition, the Semiparametric method also takes longer time to numerically estimate the parameters α and β especially when the combined sample size is large, while for binary scan the MLEs of α and β are available in a closed form. However, although 100 doesn’t sound like a large number in simulation, the results are reasonable. From Figures 4.13 and 4.14, it is already clear that the Semiparametric method achieves comparable good power as Kulldorff’s method with the correct model. If Kulldorff’s method is applied with an inappropriate model, the power may decrease a lot. Figure 4.15 demonstrates that when the difference between the cluster and non-cluster region is small, the detecting power for both methods decreases, but the likelihood ratio test of the Semiparametric method seems better in term of accuracy. A last note is about the π0 in Storey’s q-value method which is used to take into account the multiple testing problem. The critical part of q-value method of controlling the false discovery rate is to give a good estimate of π0 , the proportion of the true null hypotheses among all the tests. The current method we used in this study is based on the algorithm suggested by Storey et al. (2003) which assumes the distribution of p-value from each test is uniform over the (0,1) interval. However, Yang (2004) pointed out that if the p-values were not uniformly distributed, the 87 power of the q-value method may decrease [73]. He suggested to compute a weighted average of π0 from the distribution of the raw p-values which are greater than a threshold (say 0.4). Thus, it gives a better control and more robust estimate of π0 . It is also worthwhile to try some other good multiple-testing methods as well. 88 Chapter A Appendix A.1 Derivation of S, V The entries of the matrices S, V are derived by repeated differentiation of the equation (3.7) based on the fact that R dG(t) = 1 and R ωj (t)dG(t) = 1, j = 1, . . . , q. This is an extension of Fokianos et al. (2001) [13] to the vector case of the tilt function h(x). First define µ ∇≡ ∂ ∂ ∂ ∂ , ..., , ..., ∂α1 ∂αq ∂β 1 ∂β q ¶0 (A.1) Then E[∇l(α1 , ..., αq , β 1 , .., β q )] = 0. To obtain the score second moments it is convenient to define ρm ≡ 1, wm (t) ≡ 1, Z Ej [h(t)] ≡ h(t)wj (t)dG(t) (A.2) and, Z wj (t)wj 0 (t)dG(t) P 1 + qk=1 ρk wk (t) Z h(t)wj (t)wj 0 (t)dG(t) 0 P A1 (j, j ) ≡ 1 + qk=1 ρk wk (t) Z h(t)h0 (t)wj (t)wj 0 (t)dG(t) 0 P A2 (j, j ) ≡ 1 + qk=1 ρk wk (t) 0 A0 (j, j ) ≡ (A.3) (A.4) (A.5) for j, j 0 = 1, ..., q. Then, the entries in · ¸ 1 V ≡ V ar √ ∇l(α1 , ..., αq , β 1 , ..., β q ) n 89 (A.6) are, ¶ µ m X ρ2j 1 ∂l P V ar = {A0 (j, j) − ρr A20 (j, r)} n ∂αj 1 + qk=1 ρk r=1 µ ¶ 1 ∂l ∂l ρj ρj 0 P Cov , = {A0 (j, j 0 ) n ∂αj ∂αj 0 1 + qk=1 ρk m X − ρr A0 (j, r)A0 (j 0 , r)} 1 Cov n µ ∂l ∂l , ∂αj ∂β j µ ∂l ∂l , ∂αj ∂β j 0 ¶ = µ ∂l ∂l , ∂β j ∂β j 0 1+ m X ρ2 Pjq k=1 ρk {A0 (j, j)Ej [h0 (t)] ρr A0 (j, r)A01 (j, r)} (A.9) r=1 ¶ = − 1 Cov n (A.8) r=1 − 1 Cov n (A.7) 1+ m X ρj ρj 0 Pq k=1 ρk {A0 (j, j 0 )Ej 0 [h0 (t)] ρr A0 (j, r)A01 (j 0 , r)} (A.10) r=1 ¶ = 1+ ρj ρj 0 Pq k=1 ρk {−A2 (j, j 0 ) + Ej [h(t)]A01 (j, j 0 ) + A1 (j, j 0 )Ej 0 [h0 (t)] − m X ρr A1 (j, r)A01 (j 0 , r)} r=1 nj nj 0 1 XX Cov[h(²ji ), h(²j 0 k )] + n i=1 k=1 (A.11) The last term is 0 for j 6= j 0 and (nj /n)V ar[h(²j1 )] for j = j 0 . Next, as n → ∞, 1 − ∇∇0 l(α1 , ..., αq , β 1 , ..., β q ) → S n (A.12) where S is a q(1 + p) × q(1 + p) matrix with entries corresponding to j, j 0 = 1, ..., q, ρ 1 ∂ 2l Pjq → − 2 n ∂αj 1 + k=1 ρk 1 ∂ 2l − n ∂αj αj 0 −ρj ρj 0 P → 1 + qk=1 ρk Z Z P [1 + qk6=j ρk wk (t)]wj (t) P dG(t) 1 + qk=1 ρk wk (t) (A.13) wj (t)wj 0 (t) P dG(t) 1 + qk=1 ρk wk (t) (A.14) 90 Z 1 ∂ 2l − n ∂αj ∂β 0j → 1 ∂2l − n ∂αj ∂β 0j 0 −ρj ρj 0 P → 1 + qk=1 ρk − 1 ∂ 2l n ∂β j ∂β 0j 1+ ρ Pjq k=1 ρk [1 + Z Pq ρk wk (t)]wj (t)h0 (t) Pq dG(t) 1 + k=1 ρk wk (t) k6=j (A.15) wj (t)wj 0 (t)h0 (t) P dG(t) (A.16) 1 + qk=1 ρk wk (t) P Z [1 + qk6=j ρk wk (t)]wj (t)h(t)h0 (t) ρj P P → dG(t) 1 + qk=1 ρk 1 + qk=1 ρk wk (t) (A.17) 1 ∂ 2l − n ∂β j ∂β 0j 0 −ρj ρj 0 P → 1 + qk=1 ρk Z wj (t)wj 0 (t)h(t)h0 (t) P dG(t) 1 + qk=1 ρk wk (t) (A.18) It should be noted that, due to profiling, the matrix S is not the usual information matrix although it plays a similar rote. Thus when the density ratio model (3.1) holds for the true parameters α0 and β 0 , it follows under the regularity condition that α̂ and β̂ are both consistent and asymptotically normal as in (3.10) (see Sen and Singer 1993, chapter 5) [56], √ α̂ − α0 ⇒ N(0, Σ) n β̂ − β 0 where Σ = S −1 V S −1 91 A.2 Simplified Semiparametric Test Statistics for Binary Data If the data are 0-1 binary data, such as cancer or no cancer, we can simplify the likelihood and obtain closed forms for the parameter estimates. Recall: NG : The combined sample size for the whole study region. nG : The number of cases in the whole study region. NZ : The sample size within the scan window. nG : The number of cases within the scan window. ti : The ith observation from the combined sample in the whole study region, i = 1, . . . , NG . ρ1 : The relative sample size, which is equal to NZ /(NG − NZ ). xZj : The jth observation within the scan window Z, j = 1, . . . , NZ . For binary data, choose the tilt function h(x) = x. Then the profile loglikelihood with parameters α1 and β1 is `(α1 , β 1 ) = − NG X α1 +β1 ti log[1 + ρ1 e i=1 ]+ NZ X (α1 + β1 xZj ) j=1 · ¸ NZ α1 = −(NG − nG ) · log 1 + e NG − NZ ¸ · NZ α1 +β1 e − nG · log 1 + + NZ · α1 + nZ · β1 (A.19) NG − NZ The resulting maximum likelihood estimators are, ´ ³ ´ ³ (NG −NZ )−(nG −nZ ) NZ −nZ − log α̂ = log 1 NZ NG −NZ ³ ´ ³ ´ nZ nG −nZ β̂1 = log NZ −nZ − log (NG −NZ )−(nG −nZ ) nZ /NZ eα̂1 +β̂1 = (nG −nZ )/(NG −NZ ) 92 (A.20) Apparently, β1 = 0 implies α1 = 0 and eα1 +β1 = 1, which means the relative rates inside and outside the scan window are equal. By equation (3.15), the χ1 test statistic can be simplified to ½ χ1 = NG · β̂1 · where ρ1 = NZ NG −NZ · ¸¾ ρ1 nG − nZ nG − nZ · · (1 − ) · β̂1 (1 + ρ1 )2 NG − NZ NG − NZ (A.21) and β̂1 is as in equation (A.20). By equation (3.16), the likelihood ratio test statistic can be simplified to · nZ nZ nG − nZ nG −nZ NZ − nZ NZ −nZ LR = 2 log ( ) ·( ) ·( ) NZ NG − NZ NZ ¸ · ¸ (NG − NZ ) − (nG − nZ ) (NG −NZ )−(nG −nZ ) + 2 log ( ) NG − NZ · ¸ nG nG NG − nG NG −nG − 2 log ( ) ·( ) NG NG = 2 log [Kulldoff’s Bernoulli scan stat. as in equation (2.1)] (A.22) If the tilt function h(x) is chosen as (x, x2 ), the parameter β12 , which is associated with the x2 term, is actually 0. Notice that the normal equation of the likelihood for β11 , which is corresponding to the x term, is identical with the normal equation for β12 . Thus the x2 term is confounded with the x, and β12 is not estimable. Therefore, only α1 and β11 are estimated, and the final results are still the same as in the situation where h(x) = x. 93 Bibliography [1] Alexander, F., Cartwright, R., McKinney, P.A., and Ricketts T.J. (1990). Investigation of spatial clustering of rare diseases: childhood malignancies in North Humberside. Journal of Epidemiology and Community Health, 44, 39-46. [2] Barlow R.E., Bartholomew D. J., Bremner J. M., Brunk H. D. (1974). Statistical Inference under Order Restrictions. Wiley, New York. [3] Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, Vol. 57, No. 1, 289-300. [4] Bonetti, M. and Pagano, M. (2005). The interpoint distance distribution as a descriptor of point patterns: An application to cluster detection. Statistics in Medicine (in press). [5] Colt, J.S. and Blair, A. (1998). Parental Occupational Exposures and Risk of Childhood Cancer, Environmental Health Perspectives Supplements, 106, No. S3, 909-925. [6] Cuzick J. and Edwards R. (1990). Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society, series B, 52: 73-104. [7] Anna Ditkoff, “Murder Ink”., Baltimore City Paper (January 11, 2006) [8] Morgan Quitno’s America’s Safest Cities, “City Crime Rankings by Population Group”. http://www.morganquitno.com/cit06pop.htm#500,000+ [9] CNN/Morgan Quitno’s America’s Safest Cities, “Top 25: Most dangerous and safest cites”. http://money.cnn.com/2006/10/30/real estate/ Most dangerous cities/index.htm [10] Dwass M. (1957). Modified Randomization tests for Nonparamertic hypotheses. The Annuals of Mathematical Statistics, 28, 181-187. [11] Dykstra R., Kochar S., Robertson T. (1995). Inference for Likelihood Ratio Ordering in the Two-Sample Problem. Journal of the American Statistical Association, Vol. 90, No. 431, 1034-1040. [12] Eck J.E., Chainey S., Cameron J.G., Leitner M., Wilson R.E. (2005). Mapping Crime: Understanding Hot Spots. NIJ special report, U.S. Department of Justice, Office of Justice Programs, National Institute of Justice, Aug, 2005. 94 [13] Fokianos, K., Kedem, B., Qin, J., and Short, D. A. (2001). A semiparametric approach to the one-way layout. Technometrics, 43, 56-65. [14] Fokianos K. (2006). Density Ratio Model Selection. Journal of Statistical Computations and Simulation. To appear. [15] Fokianos K. and Kaimi I., (2006). On the effect of misspecifying the density of ratio model. Annals of the Institute of Statistical Mathematics, Vol. 58, No. 3, 475-497. [16] Gagnon R. (2005), Certain Computational Aspects of Power Efficiency and of State Space Models. Ph.D Dissertation, Department of Mathematics, University of Maryland, College Park, Maryland. [17] Glaz J. and Naus J. (1991). Tight bounds for scan statistics probabilities for discrete data. The Annals of Applied Probability, Vol. 1, No., 306-318. [18] Glaz J., Naus J., and Wallenstein S. (2001). Scan statistics. Springer, New York. [19] Glaz J. and Zhang Z. (2004). Multiple Window Discrete Scan Statistics. Journal of Applied Statistics, Vol. 31, Issue 8, 967-980. [20] Glaz J. and Zhang Z. (2006). Maximum scan score-type statistics Statistics and Probability Letters, Volume 76, Issue 13, 1316-1322. [21] Hjalmars U., Kulldorff M., Gustafsson G., and Nagarwalla N. (1996). Childhood leukemia in Sweeden using GIS and a spatial scan statistics for cluster detection. Statistics in Medicine, 15, 707-715. [22] Huang L., Kulldorff M., Gregorio D. (2007). A Spatial Scan Statistic for Survival Data. Biometrics 63 (1), 109-118. [23] Jung I., Kulldorff M., Klassen A.C. (2007). A spatial scan statistic for ordinal data. Statistics in Medicine, Volume 26, Issue 7, 1594-1607. [24] Kay, R., and Little, S. (1987). Transformations of the explanatory variables in the logistic regression model for binary data. Biometrika, 74, 495-501. [25] Kedem, B., Wolff, D.B., and Fokianos, K. (2004). Statistical Comparison of Algorithms. IEEE Transactions on Instrumentation and Measurement, 53, 770776. 95 [26] Kedem B., Wen S. (2007). Semi-parametric Cluster Detection. Journal of Statistical Theory and Practice, inaugural issue, 1, 49-72. [27] Keziou, A. and Leoni-Aubin, S. (2005). Test of homogeneity in semiparametric two-sample density ratio models. Comptes Rendus de l’Academie des Sciences, Paris, Ser. I 340, 905-910. [28] Klassen A.C., Kulldorff M., Curriero F.C. (2005). Geographical clustering of prostate cancer grade and stage at diagnosis, before and after adjustment for risk factors. International Journal of Health Geographics 2005, 4:1. [29] Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics– Theory and Methods, 26, 1481-1496. [30] Kulldorff M. (1999). Spatial Scan Statistics: Models, Calculations, and Applications. Scan Statistics and Applications, edited by Joseph Glaz and N. Balakrishnan, Birkhäuser, Boston. [31] Kulldorff, M. (2001). Prospective time periodic geographical disease surveillance using a scan statistic, Journal of the Royal Statistical Society A, 164, part 1, 61-72. [32] Kulldorff, M., Tango T. Park P.J. (2003), Power comparison for disease clustering tests, Computational Statistics & Data Analysis, 42, 665-684. [33] Kulldorff M., Heffernan R., Hartman J., Assunção R.M., Mostashari F. (2005). A space-time permutation scan statistic for the early detection of disease outbreaks. PLoS Medicine, 2: 216-224. [34] Kulldorff M., Huang L., Pickle L., Duczmal L. (2006). An elliptic spatial scan statistic. Statistics in Medicine, Volume 25, Issue 22, 3929-3943. [35] Normal Model: Kulldorff M, et al., (2006), manuscript in preparation. [36] Kulldorff M., Mostashari F., Duczmal L., Yih W.K., Kleinman K., Platt R. (2007). Multivariate scan statistics for disease surveillance. Statistics in Medicine, Volume 26, Issue 8, 1824-1833 [37] LaFree, G., Morris, N., Dugan, L. and Fahey, S. (2006). Identifying Global Terrorist Hot Spots. The Psychology of Terrorism, IOS Press. 96 [38] Lawson A.B. (Editor), Biggeri A. (Editor), Böhning D. (Editor), Lesaffre E. (Editor), Jean-Fran Viel J. (Editor), Bertollini R. (Editor). (1999). Disease Mapping and Risk Assessment for Public Health, WILEY. [39] LeBeau J.L. (1999). Demonstrating The Analytical Utility of GIS for Police Operations A Final report to Grant: NIJ 97-LB-VX-K010. [40] Lu, G. (2007). Asymptotic Theory for Multiple-Sample Semiparametric Density Ratio Models and Its Application To Mortality Forecasting. Ph.D. dissertation, University of Maryland. [41] McKinney P.A., Alexander F.E., Cartwright R.A., Parker L. (1991). Parental occupations of children with leukemia in west Cumbria, North Humberside, and Gateshead. British Medical Journal, 302, 681-687. [42] Modarres R. and Patil G.P. (2006). Hotspot Detection with Bivariate Data. Journal of Statistical Planning and Inference , (S.N. Roy Centennial Volume) (in press). [43] Naus, J.I. (1965). The distribution of the size of the maximum cluster of points on a line. Journal of the American Statistical Association, 60, 532-538. [44] Naus J.I. and Wartenberg D. (1997). A double-scan statistic for clusters of two types of events. Journal of the American Statistical Association, 92: 1105-1113. [45] Naus J.I. and Wallenstein S. (2004). Multiple Window and Cluster Size Scan Procedures. Journal Methodology and Computing in Applied Probability, Vol. 6, No 4, 389-400. [46] Newell G.G. (1963). Distribution for the samllest distance between any pair of Kth nearest-neighbor random points on a line. In Rosenblatt, M. ed., Time series Analysis, Processdings of a Conference Held at Brown University. Wiley, New York, pp. 89-103. [47] Patil G.P. Taillie C. (2004). Upper level set scan statistics for detecting arbitrarily shaped hotspots.Environmental and Ecological Statistics 11, 183-197. [48] Patil G.P., Rathbun S.L., Acharya R., Patankar P., Modarres R. (2005). Upper Level Set Scan Statistic System for Detecting Arbitrarily Shaped Hotspots for Digital Governance. Proceedings of the 2005 national conference on Digital government research; Vol. 89, 281-282. [49] Pickle, L.W., Feuer, E.J., and Edwards, B.K. (2003). U.S. Predicted Cancer Incidence, 1999: Complete Maps by County and State From Spatial Projection 97 Models, National Cancer Institute, Cancer Surveillance Monograph No. 5, NIH Publication No. 03-5435. [50] Pozdnyakov V., Glaz J., Kulldorff M., and Steele M. (2004). A martingale approach to scan statistics. Annals of the Institute of Statistical Mathematics, Volume 57, Number 1, 21-37. [51] Qin, J. (1993). Empirical likelihood in biased sampling problems. The Annals of Statistics, 21, 1182-1186. [52] Qin, J., and Lawless, J.F. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics, 22, 300-325. [53] Qin, J., and Zhang, B. (1997). A goodness of fit test for logistic regression models based on case-control data. Biometrika, 84, 609-618. [54] Reiner A., Yekutieli D., Yoav Benjamini Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, Vol. 19, No. 3, 368-375. [55] Thomas Rich (1999). Mapping the Path to Problem Solving. National Institute of Justice Journal, October 1999. [56] Sen P.K. and Singer J.M. (1993). Large sample method in statistics, An introduction with Applications. Chapman and Hall/CRC, London. [57] Shmueli G., Thomas M.P., Kadane J.B., Borle S., Boatwright P. (2005). A useful distribution for fitting discrete data: revival of the Conway-MaxwellPoisson distribution. Journal of the Royal Statistical Society, Series C (Applied Statistics), 54 (1), 127-142. [58] Shmueli, G., and Fienberg, S. E. (2006). Current and Potential Statistical Methods for Monitoring Multiple Data Streams for Bio-Surveillance. Statistical Methods in Counter-Terrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication, Eds: A Wilson, G Wilson, and D H Olwell, Springer, New York. [59] Storey J.D. and Tibshirani R. (2001). Estimating false discovery rates under dependence, with applications to DNA microarrays. Technical Report 2001-28, Department of Statistics, Stanford University. (NOTE: This paper was reconceived and rewritten into Storey and Tibshirani (2003) PNAS, which is listed below.) 98 [60] Storey J.D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 64: 479-498. [61] Storey J.D., Tibshirani R. (2003). Statistical significance for genome-wide studies, Proceedings of the National Academy of Sciences, 100, 9440-9445. [62] Storey J.D., (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31, 2013-2035. [63] Storey J.D., Taylor J.E., Siegmund D. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society, Series B, 66: 187205. [64] Tango T. (2000). A test for spatial disease clustering adjusted for multiple testing. Statistics in Medicine, 19: 191-204. [65] Tango T. and Takahashi K. (2005). A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics, 4:11. [66] Tsai C., Hsueh M., Chen J., (2003). Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data. Biometrics, Vol. 59, No. 4, 1071-1081. [67] Wallenstein S. and Neff N. (1987). An approximation for the distribution of the scan statistic. Statistics in Medice, 6(2): 197-207. [68] Waller, L.A. and Lawson, A.B. (1995). The power of focused tests to detect disease clustering. Statistics in Medicine, 14, 2291-2308. [69] Weisburd D. and Braga A.A. (2006). Police Innovation: Contrasting Perspectives (Cambridge Studies in Criminology). Cambridge University Press, UK. [70] Wen S. and Kedem B. (2007). Power Study of a Semi-parametric Cluster Detection Method. Submitted to Journal of Computational Statistics and Data Analysis. [71] Wikipedia on “Montgomery County, Maryland”, http://en.wikipedia.org/wiki/Montgomery County/ [72] Wikipedia on “Fairfax County, Virginia”, http://en.wikipedia.org/wiki/Fairfax County/ 99 [73] Yang X. (2004). Qvalue Methods May not Always Control False Discovery Rate in Genomic Applications, Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, 556- 557. 100