Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Analysis in GIS Dr. M. Gavrilova Importance of correct data representation Variance and covariance Autocorrelation Applications to pattern analysis and geometric modeling Four colors, three dimensions, and two plots to visualize five data points http://www.math.yorku.ca/SCS/Gallery/ Steven Skiena, Stony Brook, NY http://www.cs.sunysb.edu/skiena http://www.math.yorku.ca/SCS/Gallery/ Results of a poll of happiness from the World Values Survey project of people throughout the world in relation to economy, GNP per capita. Many countries, particularly those in Latin America, had higher marks for happiness than their economic situation would predict. Conclusion is based on the assumption that happiness should be linearly related to GNP. An organized collection of computer hardware, software, geographic data, and personnel designed to efficiently capture, store, update, manipulate, analyze, and display all forms of geographically referenced data. Provides an efficient and generally reliable means of obtaining knowledge about spatial processes, ◦ a way of maximizing our knowledge of spatial ◦ processes with the minimum of error. Spatial Data location and attribute Pi (x, y, z) Spatial Stochastic Processes statistics and inference Spatial is special spatial autocorrelation spatial non-stationarity proximity The Space Shuttle Challenger exploded shortly after take-off in January 1986. Cause: failure of the O-ring seals used to isolate the fuel supply from burning gases. Graph from the Report of the Presidential Commission on the Space Shuttle Challenger Accident, 1986. NASA staff had analysed the data on the relation between temperature and number of O-ring failures (out of 6), but they had excluded observations where no O-rings failed, believing that they were uninformative. They were main observations showing no failure at warm temperatures (65-80 degF). Apart from the disasterouse omitting the observations with 0 failures: 1. drawing a smoothed curve to fit the points 2. removing the background grid which obscure data gives a graph which shows excessive risks associated with both high and low temperatures Reanalysis of the O-ring data involved fitting a logistic regression model. This provides a predicted extrapolation (black curve) of the probability of failure to the low (31 degF) temperature at the time of the launch and confidence bands on that extrapolation (red curves). See also Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, 423-426 There's not much data at low temperatures (the confidence band is quite wide), but the predicted probability of failure is uncomfortably high. Would you take a ride on Challenger when the weather is cold? The French engineer, Charles Minard (1781-1870), illustrated the disastrous result of Napoleon's failed Russian campaign of 1812. The graph shows the size of the army by the width of the band across the map of the campaign on its outward and return legs, with temperature on the retreat shown on the line graph at the bottom. Many consider Minard's original the best statistical graphic ever drawn. • Samples, populations, consist of individuals. • Values of certain attributes are called observations (e. g.: age, income). • Attributes vary across individuals, and they are called variables. • Variables are described by distributions and their parameters (e.g.: Normal, Poisson, ). • A random variable X assumes its value according to the outcome of a chance experiment (coin, dice). Variance is the sum of squared deviations from the mean divided by n (or n-1) sample number. Sample Variance Population Variance Spatial autocorrelation is a measure of the similarity of objects within an area. Jay Lee and Louis K. Marion, 2001 The formula to compute Moran’s index is the following: n M A w z z z ij i j i, j 2 i i where n is the number of individual points, A – area of the bounding polygon, i.e. the total area of the map including all points zi- value of the parameter measured for point I (attribute) wij is computed according to the following rule, min(dij) is the smallest of all distances between all pairs of points computed: wij min ij (d ij ) d ij zi zj In this formula, distance dij is computed according to the formulas for Euclidean, supremum or Manhattan metrics. Since dii is equal to 0, wii will become infinite, thus cases when i=j should be excluded. This will result in n2 –n pairs of points. ◦ The sum by all i,j means that ALL ORDERED PAIRS of points (i.e. order of consideration of pair ij is important) should be considered by the formula. Sometimes, only pair of sample points within a specific distance from each other are considered. Example: autocorrelation on a grid. Sample points are combined in one cell. Size and location of the cell defines autocorrelation parameters. Consider all pairs of GRID CELLS, where XC and YC now denote coordinates of the center of each grid cell and the attribute z for each grid is the sum of combined attributes of all points that belong to this cell. Result: insight on pattern analysis and correlation can be obtained. Analysis of instances of patients undergoing cardiac catheterization, and location of those instances, i.e. city blocks. Primary question: spatial variation of heart disease: random or non-random pattern? Secondary question: relationship between disease occurrence and social and demographic factors (Spatial Regression). Analysis results are affected by grid size • prone to subjective choices • constrained by spatial resolution of data Solving the problem by • using a non-arbitrary grid(s) • implementing a “guided” selection of the square unit area or grid size • Definition of a city-block grid based on the main division in the city, i.e. using the squared grid centered on the intersection between Center Street and Center Avenue as the main axes of the geometric plan thus created. • Grid regularity decreases as distance increases from its center. • L_p norms provide flexibility to adjust grid’s size and shape consequently. Application of varying L_p norms Varying spatial weights for spatial autocorrelation Autocorrelation analysis at varying scales (CDA, community) Data: 2001/1996 census Spatial Correlation Estimate Statistic = "moran" Sampling = "free" Correlation = 0.1429 Variance = 0.001341 Std. Error = 0.03662 Normal statistic = 3.921 Normal p-value (2-sided) = 8.802e-5 Null Hypothesis: No spatial autocorrelation Sensitivity of Spatial Autocorrelation to L_p norm spatial weight Proposed method useful in determining best distance best spatial weight In context of multivariate spatial regression “best” lowest variance The Calgary Journal, Regional publication, “Researchers link heart disease to urban lifestyles” on SPARCS activity profile, Oct. 26 – Nov. 8, 2005 High risk of heart attack: male, high education, married # cells* Min. Max. Mean St. dev. Sum Skew Kurt. Oil spill counts 44 (2,741) 0 3 0.02 0.162 53 9.85 113.6 Flight counts 2151 (2,741) 0 309 13.75 27.12 37,681 4.21 25.6 The mean and the standard deviation provide information about the statistical dispersion of the data; and skewness (irregular) and kurtosis (bulging in Greek) indicate highly skewed distributions or lack of normality in the data. Our exploratory analyses indicate that there is a positive spatial autocorrelation within datasets for all variables. An initial overview of the statistical distribution and normality of each of the variables selected for this study indicated absence of normality in the data. Exploratory Spatial Analysis of Illegal Oil Discharges Detected off Canada’s Pacific Coast. Norma Serra-Sogas1, Patrick O’Hara2, Rosaline Canessa3, Stefania Bertazzon4 and Marina Gavrilova5 Proper statistical analysis is important Variance and autocorrelation are two important vehicles for data analysis Combining these measures with various metrics, hierarchical structures, grids, attributes and also data filtering/visualization methods is a direction of current research.