Download Lecture # / Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Transcript
Sampling
Chapter 7 of the textbook
Pages 215-250
Introduction
Descriptive statistics allow us to describe and
summarize our data
Inferential statistics allow us to infer unknown
values and/or the probability of events occurring
using the data we have (this let’s us test
hypotheses)
Probability is at the heart of inferential statistics
because we base our results on samples
Sampling Error
Sampling error occurs when sample characteristics deviate
from population characteristics (i.e., an unrepresentative
sample)
Based on random chance
How do we know what the population characteristics are?
How can sampling error be decreased?
“sampling error is not a “mistake”…. All samples deviate
from the population in some way; thus, sampling error is
always present”
Good Points From Textbook
Uncertainty associated with sampling error
is the price one pays for using a sample.
“The appeal of statistics is not that it
removes uncertainty but rather that it
permits inference in the presence of
uncertainty.”
Sampling Bias
When the selection of the sample favors
inclusion of members of a population with
certain characteristics
Based on sampling procedures (a.k.a.
sampling design or sampling scheme)
How can sampling bias be minimized?
Why Sample?
Samples take less time and money to collect – this is particularly true
when very detailed measurements are being taken (e.g., in-depth
interviews with follow-ups)
A census has error too (contains non-sampling errors)
The population may be infinite
The act of sampling may be destructive
The population may be hypothetical
Populations may change rapidly (i.e., require repeated measurement)
Steps in the Sampling Process
Critical pre-sampling step
– Make sure there is a need to sample (i.e., see if
available data may suit your study)
– This can be challenging and can require a lot of
effort, but it is usually MUCH easier than
collecting a new sample
Rushing out to sample will often come back
to haunt you (think before you collect!)
Steps in the Sampling Process
Step 1 – Define your population
– Who or what is are you interested in (or not)
– This is directly connected with the research question(s)
you are asking
– Example: Geography Students
• Does this mean majors or any students in a geography class?
• Does this mean graduate and undergraduate students?
• Does this include alumni or only current students?
Steps in the Sampling Process
Step 2 – Construct a sampling frame
– A sampling frame is an exhaustive list of all individuals in a
population (i.e., who/what can be sampled)
– A sample is only relevant for the sample frame, not appropriate for
making inferences beyond the defined sample frame
– Differentiate target population and sample populations
• Target populations are all the individuals relevant to a study (i.e.,
who/what we want to include)
• The sample population is who/what we actually sample
• Goal is for these to be equal
• The target and sample populations can differ when some of the target
population can’t be sampled
– Example: A list of all geography majors
Steps in the Sampling Process
Step 3 – Sampling Design
– The procedures we use to select members from
the sample frames for the sample
– Many ways to do this, we’ll discuss several in
the coming slides
– Example: how we will go about picking a
sample of 20 geography majors
Steps in the Sampling Process
Step 4 – Specify the information to be collected
– Based on pre-testing (a.k.a. pilot testing) the
instrument, tools, staff abilities, logistical constraints,
etc.
– The data collected relate directly to the research
questions
– This step assesses feasibility and corrects any problems
before sampling begins
– Example: What questions we will ask the students and
about how long it will take to conduct an interview
Steps in the Sampling Process
Step 5 – Data Collection
– The actual collection of the sample
– Data accuracy and quality are determined
during this step (i.e., non-sampling error
happens here)
– Careful measurement and careful recording of
the measurements are key
Types of Samples
Non-Probability Sample – the likelihood of an
individual being sampled is unknown
– The sample may be representative or not, but the
quality of the sample cannot be determined
Probability Sample – the likelihood of an
individual being sampled is known
– Since the probabilities are known, probability theory
can be applied to make inferences
Non-Probability Samples
Judgmental – personal judgment is used to determine
which individuals should be included in a sample
Convenience – sample in which only the convenient or
accessible individuals are selected
Quota – data obtained from specific subgroups to avoid
over or under representation
Volunteer Sample – individuals self select to take part in a
study
Probability Samples
Random
Systematic
Stratified Random
Clustered
Random Samples
For finite populations – each possible sample of
size n has an equal probability of being selected
For infinite populations – all observations chosen
are statistically independent
Sampling with & without replacement
What would a random sample look like for spatial
data when mapped?
Random Number Generators
This is one mechanism for actually getting a
random sample
Key components
– Uniform probability of a number being selected
(i.e., 1/10 chance for each digit 0 to 9)
– Independence (i.e., first number has no effect
on second number)
Systematic Samples
Selecting every kth element of a sampling frame
(e.g., taking every 10th element)
This is effectively random if
– A) the starting point is random
– B) there is no natural periodicity (i.e. pattern) to the
data
What would a systematic sample look like for
spatial data when mapped?
Stratified Random Samples
The sample frame is first split into fairly homogeneous
classes, from which random samples are taken
Why would we do this?
What are the drawbacks of this approach?
What might a stratified random sample look like for spatial
data?
Are there other options?
Cluster Samples
Data are grouped into heterogeneous clusters and
a census is taken of randomly chosen clusters
Why might we use this approach, particularly for
spatial data?
Why can this approach be problematic?
What would cluster samples look like on a map?
How do we decide which design to
use?
Above all the sample should be as
representative as possible
Choices made to increase efficiency,
decrease cost or time, etc. should be made
carefully
Many sampling designs are hybrids
Sampling Distributions
Recall from chapter 2 that there are descriptive statistics
(e.g., the mean and the standard deviation) for both
samples and populations
Recall from chapter 6 that a random variable (X) can be
any value (x1 … xn) from a population, each with an
associated probability
Therefore, random variables can be defined as functions (f)
or probability distributions (e.g., a histogram or a curve)
based on the values they can take on
Sampling Distributions
Now extend this concept so that our sample is a set of random
variables from a population
What is the mean of the sample?
Just like you’d expect, the mean is the sum of the random variables
divided by n
X 1  X 2  ... X n
X
n
BUT, since each random variable is random, the mean itself is also a
random variable
Therefore the sample mean can also be defined as a function or
distribution
Sampling Distributions
Think of this as a new distribution (curve, function, etc.)
where the graphed values relate to the mean values of the
random variables (X) noted as X
Conceptually this is the histogram that you would
produce using the mean values from many independent
samples from the same population
The sample statistic is the random variable (in this case
the mean) based on a sample of random variables
Since the sample statistic is a random variable, it has a
distribution, which is known as the sampling distribution
Example
Let the population = {10,12,13,16,19,20}
The population mean (μ) = 90/6 = 15
The population standard deviation (σ) = 3.65
Let the sample size (n) = 4
All possible samples (15 total combinations):
Example Continued
Example Continued
In this case, because we have all possible samples (n=4)
from our population the X    
X
12.75  13.5  ...17
X 
 15
15
 X  1.54
Why are the standard deviation values different?
The standard deviation of a sample distribution is known
as its standard error
Central Limits Theorem
For a large n (n > 30)
The distribution of
X
will be approximately normal
The peak of the distribution is then the “mean of the means”, and
since the distribution is normal we can estimate the actual population
mean (μ) with some degree of confidence
The standard deviation of
X
is:
X 

n
As n increases the distribution of X becomes more peaked (i.e., the
variability of
decreases and more closely approximates μ)
X
Central Limits Theorem
How is this theorem useful to us?
“it provides a way of deducing the results of a sample
based only on a knowledge of population mean and
standard deviation”… and “determine the probability that a
sample mean statistics is >, <, or within a given interval”
The key to making this possible is the approximate normal
distribution, for which we can easily apply z-values
Example
Height of middle school
kids
μ = 60 inches
σ = 10 inches
What is the probability of
having a class (n = 30)
with a mean height of 70
inches?
Remember that  X  
n
z
X 
X
70  60
10
z

 1.826
10 / 30 5.477
P ( Z  1.826)  .033
Geographic Sampling
Why might we want to collect a geographic sample?
– Many data are distributed spatially, but you could argue that space
can be coded aspatially
– However:
– Recording space in addition to characteristics allows other
(independent) variables to be derived as needed
– Space can act as a place holder for things we don’t fully
understand
What can we do with geographic samples?
Welcome to the field of spatial analysis….
Geographic Sampling
The sampling frame (i.e., all possible samples) is typically
done using Cartesian coordinates (X and Y values)
Sample sites are then selected by choosing X,Y pairs using
some sampling procedure (e.g., random)
Common Geographic Sampling Methods (i.e., the
geographic unit/object being sampled)
– Quadrats
– Transects (traverses)
– Point sampling
Quadrats
A square, areal sample (i.e., a polygon)
Quadrat size depends on feature of interest & research
question – picking the “right” size can be challenging
Arrangement of quadrats on the landscape can be any of
the types previously mentioned
–
–
–
–
Random
Systematic
Stratified Random
Clustered
Quadrat orientation can also vary
Transects
Transects sample a geographic area along lines
Placing the sample lines is how randomness etc. is
included in the sample
Options discussed in textbook
–
–
–
–
Random
Systematic
Stratified random
Stratified systematic
The n value can be the total length (L) or the number of
transects (typically of uniform length)
Point Samples
Sampling at point locations
Locating the points can be done similarly to locating
quadrats
Options discussed in textbook
–
–
–
–
–
Random
Systematic
Stratified random
Clustered
Stratified, systematic, unaligned sample
Example
Carolina Vegetation Survey (CVS)
Paper is on blackboard (.pdf)
This is one example of a real sampling
approach
In biogeography, ecology, etc. we often
sample using nested quadrats
Summary
Think before you sample
Many sampling approaches (spatial or aspatial)
exist, choosing the “right” one takes experience
and some knowledge of the population
The Central Limits Theorem is related to the
distribution of certain sample statistics (the
mean in particular) and is important for
inference