Download Sampling Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Topic 3. Inferential Statistics: Probability, the normal distribution, sampling distributions, estimation,
and hypothesis testing
Inferential Statistics
 Sample data and population data – we usually have sample data
 Inferential statistics – we use sample data to make predictions (to draw inferences) about the population
 When we don't have data for every case, our predictions are probably incorrect (due to sampling error)
 We never know for certain if we are drawing incorrect conclusions
Notation for Sample Statistics and Population Parameters
Population parameters are often displayed as Greek letters (or capital Roman letters) and sample statistics are
displayed as (lowercase) Roman letters or Greek letters with a ‘hat’ (aka caret):
Population
Sample

Mean
y or Y
 or y
y
Standard deviation
Proportion
y or SY
or P


y
or sy

 or p
Probability and Probability Distributions
Probability – the ratio of the number of times the desired outcome can occur relative to the total number of all
outcomes that can occur over the long run. Probabilities are often expressed as ratios and/or proportions. Our
approach to probabilities falls within the classical or frequentist approach (rather than the Bayesian
approach)…for better or worse.
The probability of a ‘heads’ on one flip of an honest coin=1/2=0.5
 The flip of a coin is a purely random event – we cannot predict the outcome with any certainty
 Elementary outcomes: heads, tails
 This is an equal probability process:
The probability of a ‘6’ on one roll of an honest die=1/6=0.1667
 The roll of a die is a purely random event
 Elementary outcomes: 1, 2, 3, 4, 5, 6
 This is an equal probability process:
Page 1 of 10
The ‘or’ rule – the probability of a 1 or a 6 on one roll of an honest die=1/6+1/6=2/6=0.333
Elementary outcomes: 1, 2, 3, 4, 5, 6
The ‘and’ rule – the probability of a 6 and a 6 on the roll of two honest dice=1/6*1/6=1/36=0.02778
Elementary outcomes (6*6=36)
Die 1
1
2
3
4
5
6
1 1,1 1,2 1,3 1,4 1,5 1,6
2 2,1 2,2 2,3 2,4 2,5 2,6
Die 2 3 3,1 3,2 3,3 3,4 3,5 3,6
4 4,1 4,2 4,3 4,4 4,5 4,6
5 5,1 5,2 5,3 5,4 5,5 5,6
6 6,1 6,2 6,3 6,4 6,5 6,6
Probabilities tell us how often things should happen over the long run
Four flips of an ‘honest’ coin and the number of heads
 a binomial process
 both outcomes (heads and tails) are equally likely so there is a uniform distribution for one flip
 there is NOT a uniform distribution for multiple flips – with 4 flips there are 5 possible outcomes but 16
ways to get them:
Number of possible ‘heads’ on four coin flips
Number of ways to get this many heads
The elementary outcomes (there are 16)
0
1
TTTT
The probability of any one of these ways
=(0.5 * 0.5 * 0.5 * 0.5) = 1/16
The ‘combined’ probabilities
(i.e., combined within the number of ways)
0.0625
0.0625
1
4
HTTT
THTT
TTHT
TTTH
0.2500
0.3500
Probability
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
1
2
3
3
4
HHHT
HHTH
HTHH
THHH
4
1
HHHH
Sum
0.4000
0
2
6
HHTT
HTHT
HTTH
THHT
THTH
TTHH
4
Number of Heads on 4 Flips of a Coin
Page 2 of 10
0.3750
0.2500
0.0625
1.0
Here is an alternative way to show the elementary outcomes (there are 16 ‘branches’):
Probability distributions, such as the one above, tell us the relative frequency with which we should expect all
possible outcomes to occur over a large number of trials. They are based on theory rather than empirical data.
The Normal Distribution (also known as the Gaussian distribution)
The normal distribution is a theoretical probability distribution – it tells us the expected relative frequency of
every possible outcome.
Properties of the normal distribution:
 Bell-shaped
 Perfectly symmetric:
o mean=median=mode
o half of the cases are less than the mean and half of the cases are greater than the mean
A Normal Distribution ( Y =0, S Y =100):
0.0045000000
0.0040000000
0.0035000000
0.0030000000
0.0025000000
0.0020000000
0.0015000000
0.0010000000
0.0005000000
0.0000000000
-500 -450 -400 -350 -300 -250 -200 -150 -100
-50
0
50
100
150
200
250
300
350
400
450
500
No empirical distribution ever matches this theoretical distribution perfectly, but some come close. This is a
histogram describing the ‘substantive complexity’ of occupations (the 809 cases are occupations):
Page 3 of 10
Not all normal distributions are shaped exactly the same. The shape (i.e., the peakedness) is determined by the
size of the standard deviation. For example, these three normal distributions have the same mean (0), but
different standard deviations (50, 100, and 150). The larger the standard deviation, the flatter is the curve.
Despite this, they are all normal distributions.
3 Normal Distributions
0.0090000000
0.0080000000
0.0070000000
0.0060000000
Standard deviation=100
Standard deviation=150
Standard deviation=50
0.0050000000
0.0040000000
0.0030000000
0.0020000000
0.0010000000
0
0
50
0
45
0
40
0
35
0
30
0
25
0
20
0
15
10
0
50
0
0
0
0
0
0
0
0
-5
0
-1
0
-1
5
-2
0
-2
5
-3
0
-3
5
-4
0
-4
5
-5
0
0
0.0000000000
z Scores (a.k.a. Standard Scores)
A z score (or standard score) is the number of standard deviations that a raw score is above or below the mean.
Here is the formula for transforming raw scores into z scores:
yy
z
sy
The standard normal distribution
A standard normal distribution is a normal distribution represented in standard scores. For example, if we
displayed a distribution of raw GRE scores, we would have a (nearly) normal distribution. If we converted the
raw scores to z scores and then displayed the distribution of GRE z scores, then we would have a standard
normal distribution. The standard normal distribution has a mean of 0 and a standard deviation of 1.
A normal distribution:
A standard normal distribution:
0.0045000000
0.0045000000
0.0040000000
0.0040000000
0.0035000000
0.0035000000
0.0030000000
0.0030000000
0.0025000000
0.0025000000
0.0020000000
0.0020000000
0.0015000000
0.0015000000
0.0010000000
0.0010000000
0.0005000000
0.0005000000
0.0000000000
-500 -450 -400 -350 -300 -250 -200 -150 -100
0.0000000000
-50
0
50
100
150
200
250
300
350
400
450
500
Page 4 of 10
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
The normal distribution is useful because...
There is a constant area (a constant proportion of cases) under the curve lying between the mean and any given
distance from the mean when measured in standard deviation units:
 68.26% within 1 standard deviation of the mean (±34.13)
 95.44% within 2 standard deviations of the mean (±47.72)
 99.74% within 3 standard deviations of the mean (±49.87)
One application is that we can determine how unusual any given outcome is:
GRE Scores on Verbal Reasoning (N=1,421,856; July 1, 2005-June 30, 2008)
Mean=457
Standard deviation=121
Minimum=200
Maximum=800
How unusual is a GRE score on verbal reasoning of 699?
699 is 242 points or 2 standard deviations above the mean
How many people score 699 or better?
100-(50+47.72)=2.26%
Only 2.26% of people score higher than 699; this is a rare event.
Sampling Distributions
Let’s get back to the basic idea of inferential statistics: We use a sample statistic to estimate a population
parameter.
 For example, a sample mean is an estimate of a population mean (it is a ‘point estimate’).
 Due to sampling error, we know that the point estimate is probably incorrect.
 How confident can we be in the estimate from any one sample?
 We use a sampling distribution and the standard error to estimate the magnitude of the sampling error
and we take this into account when we draw inferences about the population.
Let’s go over an example (see the Excel file):
The population data (N=6)
Case
Y
Y Y
Sum
(Y  Y )
The distribution of scores in the population:
2
1
0
-1.5
2.25
35.0
2
1
-0.5
0.25
30.0
3
1
-0.5
0.25
25.0
4
2
0.5
0.25
20.0
5
2
0.5
0.25
15.0
6
3
1.5
2.25
10.0
9
0
5.5
Y (Population mean)
1.50
Y (Population
standard deviation
1.05
5.0
0.0
0
1
2
3
Now let’s pretend that we don’t know the population mean and standard deviation. We usually don’t know
these (if we did, we wouldn’t need to do the study).
Page 5 of 10
Let’s start by focusing on one sample of 3 people (Cases 1, 2, and 3):
 y123 =0.67; this sample mean is our point estimate of the population mean (notice that it is incorrect)
This, however, is only one sample; 19 other samples of 3 cases are possible.
How many possible samples of size n are there from the population of size N?
Here is a helpful formula for finding the answer:
N
N!
  
 n  n! ( N  n)!
For example, if there are 6 cases in the population (N=6) and you are drawing samples of 3 (n=3), then there are
20 possible samples:
 6
6!
6 * 5 * 4 * 3 * 2 *1 6 * 5 * 4
  


 20
 3  3! (6  3)! 3 * 2 *1(3 * 2 *1) 3 * 2 *1
Here are all 20 possible sample means (this is a sampling distribution of means). Notice that the sample means
vary across samples:
Samples (N=3)
Case 1
y
Case 2
y
Case 3
y
Sample mean
Sample standard deviation
Estimated standard error
1
1
0
2
1
3
1
0.67
0.58
0.33
2
1
0
2
1
4
2
1.00
1.00
0.58
3
1
0
2
1
5
2
1.00
1.00
0.58
4
1
0
2
1
6
3
1.33
1.53
0.88
5
1
0
3
1
4
2
1.00
1.00
0.58
6
1
0
3
1
5
2
1.00
1.00
0.58
7
1
0
3
1
6
3
1.33
1.53
0.88
8
1
0
4
2
5
2
1.33
1.15
0.67
9
1
0
4
2
6
3
1.67
1.53
0.88
10
1
0
5
2
6
3
1.67
1.53
0.88
11
2
1
3
1
4
2
1.33
0.58
0.33
12
2
1
3
1
5
2
1.33
0.58
0.33
13
2
1
3
1
6
3
1.67
1.15
0.67
14
2
1
4
2
5
2
1.67
0.58
0.33
15
2
1
4
2
6
3
2.00
1.00
0.58
16
2
1
5
2
6
3
2.00
1.00
0.58
17
3
1
4
2
5
2
1.67
0.58
0.33
18
3
1
4
2
6
3
2.00
1.00
0.58
19
3
1
5
2
6
3
2.00
1.00
0.58
20
4
2
5
2
6
3
2.33
0.58
0.33
Here is the dilemma: if sample estimates vary, how confident can we be in the estimate from any one sample?
The solution to this dilemma is to use a device known as a sampling distribution to estimate how much
variability we think there is across all possible samples. A sampling distribution is a theoretical probability
distribution that lists all possible sample estimates for the statistic in which we are interested as well as how
often we should expect each estimate to occur (‘means’ in this example). The bolded means above make up the
sampling distribution of means.
Page 6 of 10
*Note – in reality, it is often impossible to create a sampling distribution. Think of how difficult it would be to
draw all possible random samples of 2,000 people from the US population. We are doing this to illustrate the
idea behind how the sampling distribution works.
304,059,724!
(Using population estimates from July 1, 2008)
2,000!*(304,059,724  2,000)!
I also want to point out that we have a sampling distribution of the mean. Sampling distributions exist for all
sample statistics.
The sampling distribution is now its own variable with central tendency and variability:
30
Sample mean
0.67
1.00
1.33
1.67
2.00
2.33
Sum
f
%
1
4
5
5
4
1
20
5
20
25
25
20
5
100
25
20
15
10
5
0
0.67
1.00
1.33
1.67
2.00
2.33
The mean of the sampling distribution (i.e., the ‘mean of means’) is equal to the population mean (1.50). This
suggests that if we generated all possible samples of the same size, our sample estimates would be correct on
average (this is somewhat reassuring).
The standard error (in this example, the standard error of the mean) describes how much variability there is in
the estimate from sample to sample. The bigger the standard error, the more different are the estimates from
sample to sample. A lot of variability from sample to sample is bad for our confidence. We only ever have one
sample. If we think there is little consistency in the estimate from sample to sample (i.e., high variability), then
we have to be less confident in the one sample estimate that we do have. If we think there is consistency in the
estimate from sample to sample (i.e., low variability), then we can be more confident in the one sample estimate
that we do have.
Here is the formula for calculating the standard error of the mean:
Y 
Y
, where Y is the population standard deviation and n is the sample size.
n
For our sampling distribution:  Y 
Y

1.05
n
 .61
3
In reality we don’t know the population standard deviation, so we use the sample standard deviation to estimate
it (see the table). The logic here is that if there is a lot of variability across cases in the scores in our one sample
(as indicated by a larger standard deviation), then there is probably a lot of variability across cases in the scores
in the population, which would cause there to be a lot of variability in the estimates across samples. So we use
an estimate of the variability within one sample to predict how much variability there is across all samples.
For the first sample of 3 cases: S Y 
SY
n

0.58
 .33
3
Page 7 of 10
When is the standard error small?
1. When there is little variability across cases in the scores within our one sample (i.e., when there is a
small standard deviation).
2. When the sample size is large.
If you generate your sample using an appropriate method, you don’t have control over the amount of variability
in your sample. You do, however, have control over your sample size. So if you want to have confidence in the
estimate from your one sample, generate a large sample!
Notice that the sampling distribution of the means is more normally distributed than the distribution of scores in
the population (to see this, compare the bar graphs). Also, notice that the variability in the sampling distribution
(.61 or .33) is smaller than the variability in the scores in the population (1.05).
These basic properties are known as the central limit theorem (also the law of large numbers):
If large simple random samples are drawn from a population (with any distribution), then:
1. The sampling distribution will have a normal distribution.
2. The mean of the sampling distribution will be equal to the population mean.
3. The variance in the sampling distribution is equal to the variance of the population divided by the
sample size.
Let’s return to our basic dilemma: if sample estimates vary and if most estimates result in some degree of
sampling error, how confident can we be in our estimate from the sample?
Estimation – Let’s Put Standard Errors to Good Use
We often use a sample statistic as an estimate of the exact value of a population parameter (called a ‘point
estimate’). For example, we can use the GSS sample data to calculate the mean number of hours people work
in a typical week ( y =41.9 hours). We could then use this as our exact estimate for the population mean.
The problem with this approach is that we don’t know how accurate our estimate is (we haven’t yet considered
how much variability there might be in this estimate from sample to sample) and it is not very likely that the
true population mean is exactly equal to this value because of sampling error (the sample mean is the best guess,
but the chance that it is correct is relatively low; see a normal distribution).
The second type of estimation is confidence interval estimation. A confidence interval is a range of values in
which the population parameter is expected to fall. The width of the range, as we shall see, is determined by
(1) the level of confidence we want to have in our estimate and (2) the standard error.
CI  y  ( z * Y ) or CI  y  ( z * s y )
In the social sciences, we are usually satisfied to be 95% certain that the confidence interval contains the true
population parameter. If we want to reduce the risk of calculating an interval that does not contain the
population parameter, then we need to increase the width of the confidence interval by selecting a larger z score.
To be 95% certain, we should use a z score of 1.96.
To be 99% certain, we should use a z score of 2.58.
Page 8 of 10
Y 
Y
n,
sy 
sy
n
A lot of variability from sample-to-sample in the estimate (represented by a large standard error) forces us to
have a wider interval. If we want to increase the precision of the estimate (i.e., generating a narrower interval)
without increasing the risk of being wrong, we need to generate a larger sample (this will reduce the standard
error). So standard errors allow us to take uncertainty related to sampling error into account when making
predictions about the population.
Estimation of Population Proportions
Here are the formulas for the confidence interval and standard error:
CI  p  ( z *  p ) or CI  p  ( z * s p )
p 
( )(1   )
n
,
sp 
( p)(1  p)
n
The standard error for the proportion describes variation in the estimate of the proportion from sample to
sample. is the population proportion and n is the number of cases in the sample. Since we don’t know we
will p (the sample proportion) as our estimate of 
Confidence Intervals in SPSS
Let’s use SPSS and the 2000 General Social Survey to estimate the mean number of hours worked last week in
the US. Let’s estimate confidence intervals at the 95% level.
95% Confidence Level:
 Launch SPSS and open the data file
 Click analyze, descriptive statistics, and explore
 Insert work hours (HRS1) into the “Dependent list” box.
 Click Statistics and notice that the default setting for the confidence interval is 95%
 Click continue and ok
Here is the output:
Case Processing Summary
N
HRS1 NUMBER OF
HOURS WORKED
LAST WEEK
Valid
Percent
1818
64.5%
Descriptives
Cas es
Mis sing
N
Percent
999
35.5%
Total
N
2817
Percent
100.0%
HRS1 NUMBER OF
HOURS WORKED
LAST WEEK
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Page 9 of 10
Lower Bound
Upper Bound
Statis tic
41.90
41.28
Std. Error
.314
42.51
41.76
40.00
179.430
13.395
3
89
86
9.00
.212
1.668
.057
.115
Our point estimate of the population parameter is 41.90 hours. We are 95% confident that the mean number of
hours worked last week in the US population (in 2000) is between 41.28 hours and 42.51 hours. We are 95%
confident because in 95 out of 100 samples, our confidence interval would contain the population parameter.
Remember that we will get different means and standard deviations in different samples due to sampling error.
As a result, we will also get different estimates of the standard error and different confidence intervals. Despite
this, 95 samples out of 100 (assuming 95% confidence) should yield a confidence interval that contains the
population parameter.
Hypothesis Testing
We also use standard errors in hypothesis testing.
Is the mean number of work hours per week in the US population 40 hours per week?
Null Hypothesis, H0: Y  40
Alternative or Research Hypothesis, H1: Y  40
Sample mean ( y )=41.9
Sample size (valid n)=1,818
Y 
z
13.395
 0.314
1,818
y  Y
Y

41.9  40.0
 6.051
0.314
How often would we get this z score if the null hypothesis is true?
p=0.000000185184859 or 0.0000185184859%
or 1.85 times per 10,000,000 trials
We are either extraordinarily unlucky OR Y  40 is not a good assumption.
In this example, we used the standard error to take into account uncertainty in our estimate (from sampling
error). If we believe that the estimate of the mean might vary quite a bit across samples (which would be
represented by a large standard error), then the observed z score would be smaller. This would make it more
difficult to reject the null hypothesis (because smaller observed z scores are closer to the middle of the
distribution). In other words, by dividing by the standard error, we are able to take sampling error into account.
One sample t (and z) test in SPSS (you would get the same results listed above)
 Click analyze, compare means, and one-sample t test
 Insert work hours (HRS1) into the “Test Variable(s):” box.
 Insert the hypothesized value (in this example, 40) into the “Test Value:” box.
 Click ok
In sum, just about every sample statistic that we will calculate has a standard error. All standard errors describe
how much variability in the estimate there might be from sample to sample. We use this information to adjust
our predictions – for example, by making our confidence intervals wider or our z scores smaller; in both cases,
this means being more conservative in our predictions about the population.
Page 10 of 10