Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Descriptive Statistics for Spatial Distributions Review Standard Descriptive Statistics Centrographic Statistics for Spatial Data Mean Center, Centroid, Standard Distance Deviation, Standard Distance Ellipse Density Kernel Estimation, Mapping Briggs Henan University 2010 1 Spatial Analysis: successive levels of sophistication 1. Spatial data description: classic GIS capabilities – – Spatial queries & measurement, buffering, map layer overlay 2. Exploratory Spatial Data Analysis (ESDA): – – – searching for patterns and possible explanations GeoVisualization through data graphing and mapping Descriptive spatial statistics: Centrographic statistics 3. Spatial statistical analysis and hypothesis testing – Are data “to be expected” or are they “unexpected” relative to some statistical model, usually of a random process 4. Spatial modeling or prediction – Constructing models (of processes) to predict spatial outcomes (patterns) Briggs Henan University 2010 2 Standard Statistical Analysis Two parts: 1. Descriptive statistics Concerned with obtaining summary measures to describe a set of data For example, the mean and the standard deviation 2. Inferential statistics Concerned with making inferences from samples about a populations Similarly, we have Descriptive and Inferential Spatial Statistics Briggs Henan University 2010 3 Spatial Statistics Descriptive Spatial Statistics: Centrographic Statistics (This time) – single, summary measures of a spatial distribution –- Spatial equivalents of mean, standard deviation, etc.. Inferential Spatial Statistics: Point Pattern Analysis (Next time) Analysis of point location only--no quantity or magnitude (no attribute variable) --Quadrat Analysis --Nearest Neighbor Analysis, Ripley’s K function Spatial Autocorrelation (Weeks 5 and 6) – One attribute variable with different magnitudes at each location The Weights Matrix Global Measures of Spatial Autocorrelation (Moran’s I, Geary’s C, Getis/Ord Global G) Local Measures of Spatial Autocorrelation (LISA and others) Prediction with Correlation and Regression (Week 7) –Two or more attribute variables Standard statistical models Spatial statistical models 4 Briggs Henan University 2010 Standard Statistical Analysis: A Quick Review 1. Descriptive statistics – Concerned with obtaining summary measures to describe a set of data – Calculate a few numbers to represent all the data – we begin by looking at one variable (“univariate”) • Later , we will look at two variables (bivariate) Three types: – Measures of Central Tendency – Measures of Dispersion or Variability – Frequency distributions I hope you are already familiar with these. Henan University 2010 I will quickly review the mainBriggs ideas. 5 Standard Descriptive Statistics Central Tendency • Central Tendency: single summary measure for one Formulae for mean variable: 1. mean (average) 2. median (middle value) --50% larger and 50% smaller --rank order data and select middle number 3. mode (most frequently occurring) These may be obtained in ArcGIS by: --opening a table, right clicking on column heading, and selecting Statistics --going to ArcToolbox>Analysis>Statistics>Summary Statistics ADMIN_NAME Beijing Liaoning Tianjin Taiwan Shanghai Guangdong Heilongjiang Shanxi Jilin Xinjiang Hebei Guangxi Hunan Jiangxi Hong Kong Henan Hubei Chongqing Shandong Jiangsu Nei Mongol Shaanxi Hainan Macao Zhejiang Ningxia Sichuan Fujian Yunnan Anhui Guizhou Qinghai Gansu Xizang Sum Illiteracy-Prcnt Rank order 3.11 1 3.48 2 3.52 3 3.9 4 3.97 5 4.02 6 4.16 7 4.42 8 4.44 9 4.64 10 4.83 11 5.61 12 5.87 13 6.49 14 6.5 15 7.36 16 7.69 17 7.8 18 7.96 19 8.05 20 8.14 21 8.19 22 8.65 23 8.7 24 9.36 25 10.09 26 10.24 27 10.38 28 13.29 29 14.49 30 14.58 31 16.68 32 17.77 33 37.77 34 Calculation of mean and median Mean 296.15 / 34 = 8.71 Median (7.69 + 7.8)/2 = 7.75 (there are 2 “middle values”) Note: data for Taiwan is included 7 296.15 Briggs Henan University 2010 Standard Descriptive Statistics Variability or Dispersion • Dispersion: measures of spread or variability – Variance • average squared distance of observations from mean – Standard Deviation (square root of variance) • “average” distance of observations from the mean Formulae for variance n i =1 ( Xi - X ) N 2 n = i =1 2 X i - [( X ) 2 / N ] N Definition Formula Computation Formula These may be obtained in ArcGIS by: --opening a table, right clicking on column heading, and selecting Statistics --going to ArcToolbox>Analysis>Statistics>Summary Statistics Illiteracy-Prcnt (X - Xmean) (X-Xmean) squared 14.49 5.780 33.40500009 Beijing 3.11 -5.600 31.3632942 Fujian 10.38 1.670 2.787917734 Gansu 17.77 9.060 82.07827067 Guangdong 4.02 -4.690 21.99885891 Guangxi 5.61 -3.100 9.611823616 Guizhou 14.58 5.870 34.45344715 Hainan 8.65 -0.060 0.003635381 Hebei 4.83 -3.880 15.05668244 Heilongjiang 4.16 -4.550 20.70517656 Henan 7.36 -1.350 1.823294204 Hubei 7.69 -1.020 1.041000087 Hunan 5.87 -2.840 8.067270675 Nei Mongol 8.14 -0.570 0.325235381 Jiangsu 8.05 -0.660 0.435988322 Jiangxi 6.49 -2.220 4.929705969 Jilin 4.44 -4.270 18.23541185 Liaoning 3.48 -5.230 27.35597656 Ningxia 10.09 1.380 1.903588322 Qinghai 16.68 7.970 63.51621185 Shaanxi 8.19 -0.520 0.270705969 Shandong 7.96 -0.750 0.562941263 Shanghai 3.97 -4.740 22.47038832 Shanxi 4.42 -4.290 18.40662362 Sichuan 10.24 1.530 2.340000087 Taiwan 3.9 -4.810 23.1389295 Tianjin 3.52 -5.190 26.93915303 Xizang 37.77 29.060 844.466506 Xinjiang 4.64 -4.070 16.5672942 Yunnan 13.29 4.580 20.97370597 Zhejiang 9.36 0.650 0.422117734 Chongqing 7.8 -0.910 0.828635381 Hong Kong 6.5 -2.210 4.885400087 Macao 8.7 -0.010 0.000105969 ADMIN_NAME Anhui Sum 296.15 0.000 1361.370297 Mean 8.710294118 Variance 40.04030285 StanDev 6.3277 Calculation of Variance and Standard Deviation Variance from Definition Formula 1361.370/34 = 40.04 Variance from Computation Formula [3940.924 – (296.15 * 296.15)/34]/34 =40.04 Standard Deviation = 40.04 =6.33 Note: data for Taiwan is included Briggs Henan University 2010 9 Classic Descriptive Statistics: Univariate Frequency distributions A count of the frequency with which values occur on a variable 70000 60000 50000 40000 30000 20000 10000 0 US population, by age group: 50 million people age 45-59 (data for 2000) Series1 under 15 to 30 to 45 to 60 to 75 and 15 29 44 59 74 older years years years years years Source: http://www.census.gov/compendia/statab/ US Bureau of the Census: Statistical Abstract of the US Often represented by the area under a frequency curve 70000 This area represents 100% of the data 60000 50000 40000 30000 20000 100% Series1 10000 0 under 15 years 15 to 29 years 30 to 44 years 45 to 59 years 60 to 74 years 75 and older In ArcGIS, you may obtain frequency counts on a categorical variable via: --ArcToolbox>Analysis>Statistics>Frequency Frequency Distributions for China Province Data Symetric Distribution Skewed Distribution (right skew) “tail” extends to right Mean is “pulled” to the right Height of bar shows frequency There are 16 provinces with percent urban between 38.4% and 50.8% (mode) Mode = (38.1+50.8)/2 =44.5 Mean = 48.97 Median = 44.0 Symetric distribution: mean = median = mode Height of bar shows frequency There are 17 provinces with illiteracy between 5.4% and 10.7% (mode) Mode = (5.4+10.7)/2 =8.05 Mean = 8.7 Median = (7.69 + 7.8)/2 = 7.75 Symetric distribution: mean > median Frequency Distributions for China Province Data: Variability Symetric Distribution Standard deviation: A measure of “the average” distance of each observation from the mean Standard deviation = 14.8 Skewed Distribution (right skew) Standard deviation = 6.33 “tail” extends to right On average, illiteracy values are closer to the mean. There is less “spread” in this data Caution—these values are incorrect! • Why? • Incorrect to calculate mean for percentages – Each percentage has a different base population • Should calculate weighted mean X = n i =1 wixi wi =population of each n w i province i =1 • Very common error in GIS because we use aggregated data frequently 13 Briggs Henan University 2010 Correct Values! • • • • Unweighted mean = 8.7 Weighted mean = 7.75 Weighted mean is smaller. The largest provinces have lower illiteracy Why? Highest rates in small provinces IlliteracyADMIN_NAME Prcnt Pop2008 IlliteracyADMIN_NAME Prcnt Guangdong 4.02 95,440,000 Ningxia 10.09 6,176,900 Henan 7.36 94,290,000 Qinghai 16.68 5,543,000 Shandong 7.96 94,172,300 Xizang (Tibet) 37.77 2,870,000 Pop2008 14 Briggs Henan University 2010 ADMIN_NAME Anhui Beijing Fujian Gansu Guangdong Guangxi Guizhou Hainan Hebei Heilongjiang Henan Hubei Hunan Nei Mongol Jiangsu Jiangxi Jilin Liaoning Ningxia Qinghai Shaanxi Shandong Shanghai Shanxi Sichuan Taiwan Tianjin Xizang Xinjiang Yunnan Zhejiang Chongqing Hong Kong Macao Illiteracy-Prcnt 14.49 3.11 10.38 17.77 4.02 5.61 14.58 8.65 4.83 4.16 7.36 7.69 5.87 8.14 8.05 6.49 4.44 3.48 10.09 16.68 8.19 7.96 3.97 4.42 10.24 3.9 3.52 37.77 4.64 13.29 9.36 7.8 6.5 8.7 Pop2008 61,350,000 22,000,000 36,040,000 26,281,200 95,440,000 48,160,000 37,927,300 8,540,000 69,888,200 38,253,900 94,290,000 57,110,000 63,800,000 24,137,300 76,773,000 44,000,000 27,340,000 43,147,000 6,176,900 5,543,000 37,620,000 94,172,300 19,210,000 34,106,100 81,380,000 23,140,000 11,760,000 2,870,000 21,308,000 45,430,000 51,200,000 31,442,300 7,003,700 542,400 x*w 888961500 68420000 374095200 467016924 383668800 270177600 552980034 73871000 337560006 159136224 693974400 439175900 374506000 196477622 618022650 285560000 121389600 150151560 62324921 92457240 308107800 749611508 76263700 150748962 833331200 90246000 41395200 108399900 98869120 603764700 479232000 245249940 45524050 4718880 Calculation of weighted mean Unweighted mean 296.15 / 34 = 8.71 Weighted mean 10,445,390,141 / 1,347,382,600 = 7.75 Note: we should also calculate a weighted standard deviation 15 Sum 296.15 1347382600 10445390141 Briggs Henan University 2010 Centrographic Statistics Descriptive statistics for spatial distributions Mean Center Centroid Standard Distance Deviation Standard Distance Ellipse Density Kernel Estimation (Add Frequency Distributions and mapping—use GeoDA to produce) Briggs Henan University 2010 1 Centrographic Statistics Measures of Centrality Measures of Dispersion – Mean Center -- Standard Distance – Centroid -- Standard Deviational Ellipse – Weighted mean center – Center of Minimum Distance • Two dimensional (spatial) equivalents of standard descriptive statistics for a single-variable (univariate). • Used for point data – May be used for polygons by first obtaining the centroid of each polygon • Best used to compare two distributions with each other – 1990 with 2000 – males with females (O&U Ch. 4 p. 77-81) Briggs Henan University 2010 17 Mean Center • Simply the mean of the X and the mean of the Y coordinates for a set of points • Sum of differences between the mean X and all other Xs is zero (same for Y) • Minimizes sum of squared distances between itself and all points min d 2 iC Distant points have large effect: Values for Xinjiang will have larger effect Provides a single point summary measure for the location of a set of points 18 Briggs Henan University 2010 Centroid • The equivalent for polygons of the mean center for a point distribution • The center of gravity or balancing point of a polygon • if polygon is composed of straight line segments between nodes, centroid given by “average X, average Y” of nodes (there is an example later) • Calculation sometimes approximated as center of bounding box – Not good • By calculating the centroids for a set of polygons can apply Centrographic Statistics to polygons 19 Briggs Henan University 2010 Centroids for Provinces of China 20 Briggs Henan University 2010 Centroids for Provinces of China 21 Briggs Henan University 2010 Warning: Centroid may not be inside its polygon • For Gansu Province, China, centroid is within neighboring province of Qinghai • Problem arises with crescentshaped polygons 22 Briggs Henan University 2010 Weighted Mean Center • Produced by weighting each X and Y coordinate by another variable (Wi) • Centroids derived from polygons can be weighted by any characteristic of the polygon – For example, the population of a province X = i=1 wixi n i=1 wi n Y= n w iyi i =1 n i =1 wi 23 Briggs Henan University 2010 10 Calculating the centroid of a polygon or the mean center of a set of points. 4,7 7,7 5 ID 1 2 3 4 5 7,3 2,3 X 2 4 7 7 6 sum Centroid/MC 26 5.2 n X= 22 4.4 Xi i =1 n n Y i ,Y = i =1 n 0 6,2 (same example data as for area of polygon) Y 3 7 7 3 2 0 10 10 5 Calculating the weighted mean center. Note how it is pulled toward the high weight point. 4,7 5 7,7 7,3 2,3 0 6,2 0 5 i X Y weight 1 2 3 4 5 2 4 7 7 6 3 7 7 3 2 3,000 500 400 100 300 sum w MC 26 22 4,300 wX 6,000 2,000 2,800 700 1,800 13,300 3.09 wY 9,000 3,500 2,800 300 600 n n wX wY X= ,Y = w w i i i =1 i i i =1 i i 16,200 3.77 10 24 Briggs Henan University 2010 Center of Minimum Distance or Median Center • Also called point of minimum aggregate travel • That point (MD) which minimizes sum of distances between itself min diMD and all other points (i) • No direct solution. Can only be derived by approximation • Not a determinate solution. Multiple points may meet this criteria—see next bullet. • Same as Median center: – Intersection of two orthogonal lines (at right angles to each other), such that each line has half of the points to its left and half to its right – Because the orientation of the axis for the lines is arbitrary, multiple points may meet this criteria. Source: Neft, 1966 25 Briggs Henan University 2010 Median and Mean Centers for US Population Median Center: Intersection of a north/south and an east/west line drawn so half of population lives above and half below the e/w line, and half lives to the left and half to the right of the n/s line Mean Center: Balancing point of a weightless map, if equal weights placed on it at the residence of every person on census day. Source: US Statistical Abstract 200326 Briggs Henan University 2010 Standard Distance Deviation • Represents the standard deviation of the distance of each point from the mean center • Is the two dimensional equivalent of standard deviation for a single variable • Given by: 2 2 ( X i X c ) ( Y i Y c ) i =1 i =1 n n Formulae for standard deviation of single variable n 2 ( X i- X) i =1 N Or, with weights i=1 wi( Xi - Xc)2 i=1 wi(Yi - Yc)2 n n N i=1 wi n 2 which by Pythagoras d iC i =1 reduces to: N ---essentially the average distance of points from the center Provides a single unit measure of the spread or dispersion of a distribution. We can also calculate a weighted standard distance analogous to the 27 weighted mean center. Briggs Henan University 2010 n 10 Standard Distance Deviation Example Circle with radii=SDD=2.9 4,7 5 7,7 X Y (X - Xc)2 (Y - Yc)2 1 2 3 4 5 2 4 7 7 6 3 7 7 3 2 10.2 1.4 3.2 3.2 0.6 2.0 6.8 6.8 2.0 5.8 sum Centroid 26 5.2 22 4.4 18.8 23.2 sum divide N sq rt 42.00 8.40 2.90 6,2 0 i 7,3 2,3 0 10 5 i X Y (X - Xc)2 (Y - Yc)2 1 2 3 4 5 2 4 7 7 6 3 7 7 3 2 10.2 1.4 3.2 3.2 0.6 2.0 6.8 6.8 2.0 5.8 sum Centroid 26 5.2 22 4.4 18.8 23.2 sum of sums divide N sq rt sdd = n i =1 42 8.4 2.90 ( Xi - Xc ) 2 i =1 (Yi - Yc ) 2 n N Briggs Henan University 2010 28 Standard Deviational Ellipse: concept • Standard distance deviation is a good single measure of the dispersion of the points around the mean center, but it does not capture any directional bias – doesn’t capture the shape of the distribution. • The standard deviation ellipse gives dispersion in two dimensions • Defined by 3 parameters – Angle of rotation – Dispersion (spread) along major axis – Dispersion (spread) along minor axis The major axis defines the direction of maximum spread of the distribution The minor axis is perpendicular to it and defines the minimum spread 29 Briggs Henan University 2010 Standard Deviational Ellipse: calculation • Formulae for calculation may be found in references such as – Lee and Wong pp. 48-49 – Levine, Chapter 4, pp.125-128 • Basic concept is to: – Find the axis going through maximum dispersion (thus derive angle of rotation) – Calculate standard deviation of the points along this axis (thus derive the length (radii) of major axis) – Calculate standard deviation of points along the axis perpendicular to major axis (thus derive the length (radii) of minor axis) 30 Briggs Henan University 2010 Mean Center & Standard Deviational Ellipse: example There appears to be no major difference between the location of the software and the telecommunications industry in North Texas. 31 Briggs Henan University 2010 Implementation in ArcGIS In ArcToolbox Median Center for a set of points Standard deviation ellipse Centroid for a set of points Standard distance • To calculate centroid for a set of polygons, with ArcGIS: ArcToolbox>Data Management Tools>Features>Feature to Point (requires ArcInfo) • To calculate using GeoDA: 32 – Tools>Shape>Polygons to Centroids Briggs Henan University 2010 Density Kernel Estimation • commonly used to “visually enhance” a point pattern • Is an example of “exploratory spatial data analysis” (ESDA) Kernel=10,000 Kernel=5,000 33 Briggs Henan University 2010 low high low high • SIMPLE Kernel option (see example above) – A “neighborhood” or kernel is defined around each grid cell consisting of all grid cells with centers within the specified kernel (search) radius – The number of points that fall within that neighborhood is totaled – The point total is divided by the area of the neighborhood to give the grid cell’s value • Density KERNEL option – a smoothly curved surface is fitted over each point – The surface value is highest at the location of the point, and diminishes with increasing distance from the point, reaching zero at the kernel distance from the point. – Volume under the surface equals 1 (or the population value if a population variable is used) – Uses quadratic kernel function described in Silverman (1986, p. 76, equation 4.5). – The density at each output grid cell is calculated by adding the values of all the kernel surfaces where they overlay the grid cell center. Implementation in ArcGIS • If specify a “population field” software calculates as if there are that number of points at that location. • The search radius: • the size of the neighborhood or kernel which is successively defined around every cell (simple kernel) or each point (density kernel) • Output cell size: • Size of each raster cell • Search radius and output cell size are based on measurement units of the data (here it is feet) • It is good to “round” them (e.g. to 10,000 and 1,000) What have we learned today? • We have learned about descriptive spatial statistics, often called Centrographic Statistics • Next time, we will learn about Inferential Spatial Statistics 36 Briggs Henan University 2010 Project for you • The China data on my web site has population data for the provinces of China in 2008 • Obtain population counts for 2000, 1990 and/or any other year • Calculate the weighted mean center of China’s population for each year • Be sure to use the same set of geographic units each time – For example, if you do not have data for Taiwan or Hong Kong for one year, omit these geographic units for all years 37 Briggs Henan University 2010 Texts O’Sullivan, David and David Unwin, 2010. Geographic Information Analysis. Hoboken, NJ: John Wiley, 2nd ed. Other Useful Books: Mitchell, Andy 2005. ESRI Guide to GIS Analysis Volume 2: Spatial Measurement & Statistics. Redlands, CA: ESRI Press. Allen, David W 2009. GIS Tutorial II: Spatial Analysis Workbook. Redlands, CA: ESRI Press. Wong, David W.S. and Jay Lee 2005. Statistical Analysis of Geographic Information. Hoboken, NJ: John Wiley, 2nd ed. Ned Levine and Associates, Crime Stat III Manual, Washington, D.C. National Institutes of Justice, 2004 with later updates. http://www.icpsr.umich.edu/CrimeStat/ Density Kernel Estimation Silverman, B.W. 1986. Density Estimation for Statistics and Data Analysis. New York: Chapman and Hall.