Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Sampling Chapter 7 of the textbook Pages 215-250 Introduction Descriptive statistics allow us to describe and summarize our data Inferential statistics allow us to infer unknown values and/or the probability of events occurring using the data we have (this let’s us test hypotheses) Probability is at the heart of inferential statistics because we base our results on samples Sampling Error Sampling error occurs when sample characteristics deviate from population characteristics (i.e., an unrepresentative sample) Based on random chance How do we know what the population characteristics are? How can sampling error be decreased? “sampling error is not a “mistake”…. All samples deviate from the population in some way; thus, sampling error is always present” Good Points From Textbook Uncertainty associated with sampling error is the price one pays for using a sample. “The appeal of statistics is not that it removes uncertainty but rather that it permits inference in the presence of uncertainty.” Sampling Bias When the selection of the sample favors inclusion of members of a population with certain characteristics Based on sampling procedures (a.k.a. sampling design or sampling scheme) How can sampling bias be minimized? Why Sample? Samples take less time and money to collect – this is particularly true when very detailed measurements are being taken (e.g., in-depth interviews with follow-ups) A census has error too (contains non-sampling errors) The population may be infinite The act of sampling may be destructive The population may be hypothetical Populations may change rapidly (i.e., require repeated measurement) Steps in the Sampling Process Critical pre-sampling step – Make sure there is a need to sample (i.e., see if available data may suit your study) – This can be challenging and can require a lot of effort, but it is usually MUCH easier than collecting a new sample Rushing out to sample will often come back to haunt you (think before you collect!) Steps in the Sampling Process Step 1 – Define your population – Who or what is are you interested in (or not) – This is directly connected with the research question(s) you are asking – Example: Geography Students • Does this mean majors or any students in a geography class? • Does this mean graduate and undergraduate students? • Does this include alumni or only current students? Steps in the Sampling Process Step 2 – Construct a sampling frame – A sampling frame is an exhaustive list of all individuals in a population (i.e., who/what can be sampled) – A sample is only relevant for the sample frame, not appropriate for making inferences beyond the defined sample frame – Differentiate target population and sample populations • Target populations are all the individuals relevant to a study (i.e., who/what we want to include) • The sample population is who/what we actually sample • Goal is for these to be equal • The target and sample populations can differ when some of the target population can’t be sampled – Example: A list of all geography majors Steps in the Sampling Process Step 3 – Sampling Design – The procedures we use to select members from the sample frames for the sample – Many ways to do this, we’ll discuss several in the coming slides – Example: how we will go about picking a sample of 20 geography majors Steps in the Sampling Process Step 4 – Specify the information to be collected – Based on pre-testing (a.k.a. pilot testing) the instrument, tools, staff abilities, logistical constraints, etc. – The data collected relate directly to the research questions – This step assesses feasibility and corrects any problems before sampling begins – Example: What questions we will ask the students and about how long it will take to conduct an interview Steps in the Sampling Process Step 5 – Data Collection – The actual collection of the sample – Data accuracy and quality are determined during this step (i.e., non-sampling error happens here) – Careful measurement and careful recording of the measurements are key Types of Samples Non-Probability Sample – the likelihood of an individual being sampled is unknown – The sample may be representative or not, but the quality of the sample cannot be determined Probability Sample – the likelihood of an individual being sampled is known – Since the probabilities are known, probability theory can be applied to make inferences Non-Probability Samples Judgmental – personal judgment is used to determine which individuals should be included in a sample Convenience – sample in which only the convenient or accessible individuals are selected Quota – data obtained from specific subgroups to avoid over or under representation Volunteer Sample – individuals self select to take part in a study Probability Samples Random Systematic Stratified Random Clustered Random Samples For finite populations – each possible sample of size n has an equal probability of being selected For infinite populations – all observations chosen are statistically independent Sampling with & without replacement What would a random sample look like for spatial data when mapped? Random Number Generators This is one mechanism for actually getting a random sample Key components – Uniform probability of a number being selected (i.e., 1/10 chance for each digit 0 to 9) – Independence (i.e., first number has no effect on second number) Systematic Samples Selecting every kth element of a sampling frame (e.g., taking every 10th element) This is effectively random if – A) the starting point is random – B) there is no natural periodicity (i.e. pattern) to the data What would a systematic sample look like for spatial data when mapped? Stratified Random Samples The sample frame is first split into fairly homogeneous classes, from which random samples are taken Why would we do this? What are the drawbacks of this approach? What might a stratified random sample look like for spatial data? Are there other options? Cluster Samples Data are grouped into heterogeneous clusters and a census is taken of randomly chosen clusters Why might we use this approach, particularly for spatial data? Why can this approach be problematic? What would cluster samples look like on a map? How do we decide which design to use? Above all the sample should be as representative as possible Choices made to increase efficiency, decrease cost or time, etc. should be made carefully Many sampling designs are hybrids Sampling Distributions Recall from chapter 2 that there are descriptive statistics (e.g., the mean and the standard deviation) for both samples and populations Recall from chapter 6 that a random variable (X) can be any value (x1 … xn) from a population, each with an associated probability Therefore, random variables can be defined as functions (f) or probability distributions (e.g., a histogram or a curve) based on the values they can take on Sampling Distributions Now extend this concept so that our sample is a set of random variables from a population What is the mean of the sample? Just like you’d expect, the mean is the sum of the random variables divided by n X 1 X 2 ... X n X n BUT, since each random variable is random, the mean itself is also a random variable Therefore the sample mean can also be defined as a function or distribution Sampling Distributions Think of this as a new distribution (curve, function, etc.) where the graphed values relate to the mean values of the random variables (X) noted as X Conceptually this is the histogram that you would produce using the mean values from many independent samples from the same population The sample statistic is the random variable (in this case the mean) based on a sample of random variables Since the sample statistic is a random variable, it has a distribution, which is known as the sampling distribution Example Let the population = {10,12,13,16,19,20} The population mean (μ) = 90/6 = 15 The population standard deviation (σ) = 3.65 Let the sample size (n) = 4 All possible samples (15 total combinations): Example Continued Example Continued In this case, because we have all possible samples (n=4) from our population the X X 12.75 13.5 ...17 X 15 15 X 1.54 Why are the standard deviation values different? The standard deviation of a sample distribution is known as its standard error Central Limits Theorem For a large n (n > 30) The distribution of X will be approximately normal The peak of the distribution is then the “mean of the means”, and since the distribution is normal we can estimate the actual population mean (μ) with some degree of confidence The standard deviation of X is: X n As n increases the distribution of X becomes more peaked (i.e., the variability of decreases and more closely approximates μ) X Central Limits Theorem How is this theorem useful to us? “it provides a way of deducing the results of a sample based only on a knowledge of population mean and standard deviation”… and “determine the probability that a sample mean statistics is >, <, or within a given interval” The key to making this possible is the approximate normal distribution, for which we can easily apply z-values Example Height of middle school kids μ = 60 inches σ = 10 inches What is the probability of having a class (n = 30) with a mean height of 70 inches? Remember that X n z X X 70 60 10 z 1.826 10 / 30 5.477 P ( Z 1.826) .033 Geographic Sampling Why might we want to collect a geographic sample? – Many data are distributed spatially, but you could argue that space can be coded aspatially – However: – Recording space in addition to characteristics allows other (independent) variables to be derived as needed – Space can act as a place holder for things we don’t fully understand What can we do with geographic samples? Welcome to the field of spatial analysis…. Geographic Sampling The sampling frame (i.e., all possible samples) is typically done using Cartesian coordinates (X and Y values) Sample sites are then selected by choosing X,Y pairs using some sampling procedure (e.g., random) Common Geographic Sampling Methods (i.e., the geographic unit/object being sampled) – Quadrats – Transects (traverses) – Point sampling Quadrats A square, areal sample (i.e., a polygon) Quadrat size depends on feature of interest & research question – picking the “right” size can be challenging Arrangement of quadrats on the landscape can be any of the types previously mentioned – – – – Random Systematic Stratified Random Clustered Quadrat orientation can also vary Transects Transects sample a geographic area along lines Placing the sample lines is how randomness etc. is included in the sample Options discussed in textbook – – – – Random Systematic Stratified random Stratified systematic The n value can be the total length (L) or the number of transects (typically of uniform length) Point Samples Sampling at point locations Locating the points can be done similarly to locating quadrats Options discussed in textbook – – – – – Random Systematic Stratified random Clustered Stratified, systematic, unaligned sample Example Carolina Vegetation Survey (CVS) Paper is on blackboard (.pdf) This is one example of a real sampling approach In biogeography, ecology, etc. we often sample using nested quadrats Summary Think before you sample Many sampling approaches (spatial or aspatial) exist, choosing the “right” one takes experience and some knowledge of the population The Central Limits Theorem is related to the distribution of certain sample statistics (the mean in particular) and is important for inference