Download Sampling Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sociology 5811:
Lecture 7: Samples, Populations,
The Sampling Distribution
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Problem Set #2 due today!
Review: Populations
• Population: The entire set of persons, objects, or
events that have at least one common characteristic
of interest to a researcher (Knoke, p. 15)
• Beyond literal definition, a population is the general group
that we wish to study and gain insight into
• Sample: A subset of a population
• Random Sample: A sample chosen from a
population such that each observation has an equal
chance of being selected (Knoke, p. 77)
• Randomness is one strategy to avoid biased samples.
Review: Statistical Inference
• Statistical inference: making statistical
generalizations about a population from evidence
contained in a sample (Knoke, 77)
• When is statistical inference likely to work?
• 1. When a sample is large
• If a sample approaches the size of the population, it is likely
be a good reflection of that population
• 2. When a sample is representative of the entire
population
• As opposed to a sample that is atypical in some way, and
thus not reflective of the larger group.
Populations and Samples
• Population parameters (μ, σ) are constants
• There is one true value, but it is usually unknown
• Sample statistics (Y-bar, s) are variables
• Up until now we’ve treated them as constants
• But, there are many possible samples
• The value of mean, S.D. vary depending on which sample
you have
• Like any variable, the mean and S.D. have a
distribution
• Called the “sampling distribution”
• Made up of all values for any given population
Populations and Samples:
Overview
Characteristics
Characteristics
are:
Notation
Estimate
Population
Sample
“parameters”
“statistics”
constant (one variables (varies
for population) for each sample)
Roman ( Y , s)
Greek (, )
“hat”: σ̂
“point estimate”
based on sample
Population and Sample Distributions


Y
s
Estimating the Mean
• Suppose we want to know the mean of a
population (μ). What do we do?
• Plan A: Spend $100 million dollars to survey our
entire population
• If it is even possible to survey the whole population
• Plan B: Spend $1,000 sampling a few hundred
people.
• Estimate the mean
• Simply use formulas to estimate mu:
μ̂
Estimating the Mean
• Question: Given our sample, what is our best
guess of the population mean?
• Answer: The sample mean: Y-bar
• Look at Y-bar, assume that it is a “good guess”
• Thus, we calculate:
1
μ̂  Y 
N
N
Y
i 1
i
Estimating the Mean
• Issue: There are an infinite number of possible
samples that one can take from any population
– Each possible sample has a mean, most of which are
different
• Some are close to the population mean, some not
• Q: How do we know if we got a “good guess”?
• A: We can’t know for sure. We may draw
incorrect conclusions about the mean
• But: We can use probability theory to determine if our
guess is likely to be good!
Estimates and Sampling Distributions
• It is possible to take more than one sample
• And calculate more than one estimate of the mean
• If we took many samples (and calculated many
means), we’d see a range of estimates
• We could even plot a histogram of the many estimates
• Our confidence in our guess depends on how
“spread out” the range of guesses tends to be
• The “standard deviation” of that particular histogram.
Sampling Distributions
• Sampling Distribution: The distribution of
estimates created by taking all possible unique
samples (of a fixed size) from a population
• Example: Take every possible 10-person sample
of sociology graduate students (all combinations)
• 1. Calculate the mean of each sample
• 2. Graph a histogram of all estimates
• This is called “the sampling distribution of the mean”
• Note: The sampling distribution is rarely known
• It is typically thought of as a probability distribution.
Sampling Distribution Notation
• Population mean and S.D. are: , 
• Each sample has a mean and S.D.: Y-bar, s
• The sampling distribution of the mean (i.e., the
distribution of mean-estimates) also has a mean
• And a S.D., aka the “standard error”
• Mean, S.D. of sampling distribution: μ Y
σY
• Question: Why are they Greek?
• A:Because all possible samples represent a population
• Question: Why is there a sub-Y-bar?
• Because it is the mean of all possible Y-bars (means)
Sampling Distribution of the Mean
• It turns out that under some circumstances, the
shape of the sampling distribution of the mean
can be determined
– Thus allowing one to get a sense of the range of
estimates of the mean one is likely to see
• If distribution is narrow, our guess is probably good!
• If S.D. is large, our guess may be quite bad
• This provides insight into the probable location of
the population mean
• Even if you only have one single sample to look at
• This “trick” lets us draw conclusions!!!
Sampling Distribution Example
• Let’s create a sampling distribution from a small
population,  = 52. (Sample N = 3)
Case
# of CDs
1
30
2
100
3
20
4
70
5
40
• Note how the mean varies
depending on the sample
• Mean of cases 1,2,3 = 50
• Mean of 2,4,5 = 70
• For this population (N=5)
we can calculate all
possible means based on
sample size 3
Sampling Distribution Example
• First, we must calculate every possible mean
Case
# of CDs
1
30
2
100
3
20
4
70
5
40
•
•
•
•
•
•
•
•
•
•
1,2,3 = 50
1,2,4 = 66.67
1,2,5 = 56.67
1,3,4 = 40
1,3,5 = 30
1,4,5 = 46.67
2,3,4 = 63.33
2,3,5 = 53.33
2,4,5 = 70
3,4,5 = 43.33
Sampling Distribution Example
Sample
1
2
3
4
5
6
7
8
9
10
Y-bar
50
66.67
56.67
40
30
46.67
63.33
53.33
70
43.33
• Here, you can see how the
sample mean is really a variable
• This complete list of all
possible means is the sampling
distribution
• As a probability distribution, this
tells us the probability of picking a
sample with each mean
• Note: Sampling Dist mean = 52
• Same as population mean!
Sampling Distribution Example
• Histogram of Sampling Distribution (N=3):
• Note: The
distribution
4
 = 52
centers around
3
the population
mean
2
• And, it is roughly
1
symmetrical
0
17-27 27-37 37-47 47-57 57-67 67-77 77-87
Sampling Distribution Example
• As a probability distribution, the sampling
distribution gives a sense of the quality of our
estimate of 
The probability of
Probability =
Frequency / N
.4
.3
.2
 = 52
picking a sample with a
mean that is within +/- 5
of  is p = .3 (30%)
The probability of
overestimating  by
more than 15 is
about p = .1 (10%)
.1
0
17-27 27-37 37-47 47-57 57-67 67-77 77-87
Q: What is the
probability of a
“poor” estimate of ?
Sampling Distribution Example
• Note: If the sampling distribution is narrow, most
of our estimates of the mean will be good
• That is, they will be close to , the population mean
• If the sampling distribution is wide, the
probability of a “bad” estimate goes up
• A measure of dispersion can help us assess the
sampling distribution
• Recall: the standard deviation of a sampling distribution is
called: the standard error
• It tells us the width of the sampling distribution!
The Central Limit Theorem
• But, how do we know the width of the sampling
distribution?
• Statisticians have shown that the sampling
distribution will have consistent properties, if we
have a large sample
• Several of these properties constitute the “Central
Limit Theorem”
• These properties provide the basis for drawing statistical
inferences about the mean.
The Central Limit Theorem
• If you have a large sample (Large N):
• 1. The sampling distribution of the mean (and
thus all possible estimates of the mean) cluster
around the true population mean
• 2. They cluster as a normal curve
• Even if the population distribution is not normal
• 3. The estimates are dispersed around the
population mean by a knowable standard
deviation (sigma over root N)
The Central Limit Theorem
• Formally stated:
1. As N grows large, the sampling distribution of
the mean approaches normality
2. μ Y  μ Y
σY
3. σ Y 
N
Central Limit Theorem: Visually


Y
s
σY
μY
Implications of the C.L.T
• What does this mean for us?
• Typically, we only have one sample, and thus
only one estimate of 
• The actual value of  is unknown
• So we don’t know the center of the sampling distribution
• All we know for certain is that our estimate falls
somewhere in the sampling distribution
• This is always true by definition
• And, later, we’ll estimate its width.
Implications of the C.L.T
• Visually: Suppose we observe mu-hat = 16
There are
many
possible
locations
of 
μ̂  16
μ
μ̂  16
μ
Sampling
distribution
But, mu-hat always
falls within the
sampling
distribution
μ
μ̂  16
μ̂  16
μ
Implications of the C.L.T
• We know that the mean from our sample falls
somewhere in this sampling distribution
• Which has mean , standard deviation  over square root N
• If we can estimate , we can estimate sigma over
root N... The “Standard Error” of the mean
• We don’t know exactly where the sample falls
• But, laws of probability suggest that we are most likely to
draw a sample w/mean from near the center
• Recall: 67% fall +/- 1 SD, 95 +/- 2SD in a normal curve
• So, we can determine the range around  in which 95% (or
99%, or 99.9%) of cases will fall.
Implications of the C.L.T
• What is the relation between the Standard Error
and the size of our sample (N)?
• Answer: It is an inverse relationship.
• The standard deviation of the sampling distribution shrinks
as N gets larger
• Formula:
σY
σY 
N
• Conclusion: Estimates of the mean based on
larger samples tend to cluster closer around the
true population mean.
Implications of the CLT
• Visually: The width of the sampling distribution
is an inverse function of N (sample size)
– The distribution of mean estimates based on N = 10
will be more dispersed. Mean estimates based on
N = 50 will cluster closer to .
μ̂
μ
Smaller sample size
μ̂
μ
Larger sample size
Confidence Intervals
• Benefits of knowing the width of the sampling
distribution:
• 1. You can figure out the general range of error
that a given point estimate might miss by
• based on the range around the true mean that the estimates
will fall
• 2. And, this defines the range around an estimate
that is likely to hold the population mean
• A “confidence interval”
• Note: These only work if N is large!
Confidence Interval
• Confidence Interval: “A range of values around a
point estimate that makes it possible to state the
probability that an interval contains the
population parameter between its lower and upper
bounds.” (Bohrnstedt & Knoke p. 90)
• It involves a range and a probability
• Examples:
• We are 95% confident that the mean number of CDs owned
by grad students is between 20 and 45
• We are 50% confident the mean rainfall this year will be
between 12 and 22 inches.
Confidence Interval
• Visually: It is probable that  falls near mu-hat
μ̂  16
μ
Range where  is
unlikely to be
Probable
values of 
μ
μ
Q: Can  be this
far from mu-hat?
Answer: Yes, but it is
very improbable
Confidence Interval
• To figure out the range in of “error” in our mean
estimate, we need to know the width of the
sampling distribution
– The Standard Error! (The S.D. of this distribution)
• The Central Limit Theorem provides a formula:
σY
σY 
N
• Problem: We do not know the exact value of
sigma-sub-Y, the population standard deviation!
Confidence Interval
• Question: How do we calculate the standard
error if we don’t know the population S.D.?
• Answer: We estimate it using the information we
have
• Formula for best estimate:
sY
σ̂ Y 
N
• Where N is the sample size and s-sub-Y is the
sample standard deviation
95% Confidence Interval Example
• Suppose a sample of 100 students with mean SAT
score of 1020, standard deviation of 200
• How do we find the 95% Confidence Interval?
• If N is large, we know that:
• 1. The sampling distribution is roughly normal
• 2. Therefore 95% of samples will yield a mean estimate
within 2 standard deviations (of the sampling distribution)
of the population mean ()
• Thus, 95% of the time, our estimates of  (Y-bar)
are within two “standard errors” of the actual
value of  .
95% Confidence Interval
• Formula for 95% confidence interval:
95% CI : Y  2(σY )
• Where Y-bar is the mean estimate and sigma (Ybar) is the standard error
• Result: Two values – an upper and lower bound
• Adding our estimate of the standard error:
 sY 
Y  2(σ̂Y )  Y  2

 N
95% Confidence Interval
• Suppose a sample of 100 students with mean SAT
score of 1020, standard deviation of 200
s
• Calculate:
95% CI : Y  2( )
N
200
10
200
1020  (2)(
)  1020  2(
100
 1020  2(20)  1020  40
• Thus, we are 95% confident that the population
mean falls between 980 and 1060.
)
Related documents