Download Statistical Inference Illustrated

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Single Sample Statistical Inference, Illustrated
Richard Read Allen, Peak Stat, Evergreen, CO
ABSTRACT
This is a basic statistical tutorial illustrating concepts of statistical inference on a single sample, including the Central
Limit Theorem, Estimation, Confidence Intervals, Hypothesis Testing and Power. SAS proc surveyselect will be used
to generate samples from various populations to illustrate these concepts graphically and conceptually.
INTRODUCTION
Inferential statistics are used to draw inferences about a population from a sample. Sample data, when randomly
selected, has the capability of mirroring the population from which it was collected. Through the use of statistics we
can infer attributes and make predictions about an unknown population, with some degree of confidence, as if we
measured the population itself.
ESTIMATION
In estimation, the sample is used to estimate a parameter and a confidence interval about the estimate is constructed.
With statistics called point estimates, we can estimate the different parameters of a population. The mean of a
population is µ (mu) and the estimate of that population mean using a sample is
population is σ (sigma) and its estimate in the sample is s.
Y . The standard deviation of a
THE CENTRAL LIMIT THEOREM
In a population with mean of µ and standard deviation of σ, the sampling distribution of the means
1.
2.
will have mean value of µ and standard deviation of σ/√ n.
will approach a normal distribution as the size of each sample increases and as the number of sample
means used to generate the distribution increases.
The underlying population from which the samples were derived does not need to be normal itself, but the closer it is
to normal the required sample size and number of samples necessary to approach normality decreases. Samples of
size 30 are generally considered to be adequate for most populations.
ILLUSTRATIONS
Four different populations of various shapes will be sampled to estimate the population mean and illustrate the
distribution of the sample mean according to the Central Limit Theorem.
/*-----------------------------------------------------*/
/* Create a uniform population
/*-----------------------------------------------------*/
data uniform;
do i=0 to 50;
do j=1 to 100;
x=i;
output;
end;
end;
run;
ods select Moments BasicMeasures;
proc univariate data=uniform;
var x;
run;
proc gchart data=uniform;
vbar x / space=0 midpoints=0 to 50 by 1;
title "Uniform Population Distribution";
run;
quit;
1
/*-----------------------------------------------------*/
/* Create a triangular population
/*-----------------------------------------------------*/
data triangle;
do i=0 to 99;
do j=1 to 100;
x=i;
if j<=i+1 then output;
end;
end;
run;
ods select Moments BasicMeasures;
proc univariate data=triangle;
var x;
run;
proc gchart data=triangle;
vbar x / space=0 midpoints=0 to 99 by 1;
title "Triangular Population Distribution";
run;
quit;
/*-----------------------------------------------------*/
/* Create a v-shaped population
/*-----------------------------------------------------*/
data v_shaped;
do i=0 to 99;
do j=1 to 100;
x=i;
if j>=i+1 then output;
end;
end;
do i=100 to 199;
do j=101 to 200;
x=i;
if j<=i+1 then output;
end;
end;
run;
ods select Moments BasicMeasures;
proc univariate data=v_shaped;
var x;
run;
proc gchart data=v_shaped;
vbar x / space=0 midpoints=0 to 199 by 1;
title "V-shaped Population Distribution";
run;
quit;
/*-----------------------------------------------------*/
/* Population of hourly wages
/*-----------------------------------------------------*/
ods select Moments BasicMeasures;
proc univariate data=saved.wages;
var wage;
run;
proc gchart data=saved.wages;
vbar wage / space=0 midpoints=0 to 50 by 0.5;
title "Population Distribution of Wages in 1985";
run;
2
THE FOLLOWING FIGURES SHOW THE DISTRIBUTIONS OF THE POPULATIONS GENERATED BY THE ABOVE DATA
3
The following macro will generate the sampling distribution of the mean from each of the above populations
replicating each sample size (n) 1000 times This version of the macro generates the sampling distribution of the mean
for the wages population.
proc format;
value $pop
‘uniform’=’0 to 50 by 1’
‘triangle’=’0 to 99 by 1’
‘v_shaped’=’0 to 199 by 1’
‘wages’=’0 to 50 by 0.5’
;
run;
%macro SampleMeanDistribution(Pop,Reps,Size);
proc surveyselect data=&Pop
out=sample
sampsize=&Size
rep=&Reps;
run;
proc summary data=sample nway;
class replicate;
var x;
output out=SampleMeans(drop=_:) mean=SampleMean;
run;
proc gchart data=SampleMeans;
vbar SampleMean / space=0
midpoints=%sysfunc(putc(&Pop,$pop.));
title1 "Distribution of Sample Means - &pop Population";
title2 "&Reps Replicates of Sample Size &Size" ;
run;
quit;
%mend;
%SampleMeanDistribution(wages,1000,2);
%SampleMeanDistribution(wages,1000,5);
%SampleMeanDistribution(wages,1000,10);
%SampleMeanDistribution(wages 1000,20);
%SampleMeanDistribution(wages,1000,30);
%SampleMeanDistribution(wages,1000,40);
We can use the same macro to generate sampling distributions for the uniform, triangle and v_shaped populations.
Note how in each case that as the sample size increases the distribution of the sample means approach normality,
just as the Central Limit Theorem assures us. For distributions farthest from normal larger sample sizes are required,
but in all cases when n=30 it is pretty apparent that the distributions of the sample means is normal regardless of the
form of the original population:
4
5
6
7
8
Recall that the Central Limit Theorem also says that the sampling distribution of the mean not only approaches
normality as the sample size increases, but also for a fixed sample size as the number of sample means used to
generate the distribution increases. We use a sample size of 30 to illustrate this for these populations.
%SampleMeanDistribution(wages,10,30);
%SampleMeanDistribution(wages,100,30);
%SampleMeanDistribution(wages,500,30);
%SampleMeanDistribution(wages,1000,30);
%SampleMeanDistribution(wages,10000,30);
%SampleMeanDistribution(wages,100000,30);
9
CALCULATING A CONFIDENCE INTERVAL
Since the estimate of the statistic measured on one sample can rarely be expected to be exactly the same as the
population parameter, one should include a statement of the precision of the estimate. A confidence interval gives an
estimated range that is likely to include the unknown population parameter and the likeliness of that interval
containing the population parameter.
If independent samples are taken repeatedly from the same population and a confidence interval calculated for each
sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter.
Confidence intervals are usually calculated so that this percentage is 95% but other levels can be chosen.
The Central Limit Theorem tells us that the sampling distribution of the mean tends to normality with mean of µ and
standard deviation of σ/√ n. In the standard normal distribution, 95% of the observations should lie within 1.96 standard
deviations of the mean. This value is also referred to as Z.025, the Z critical value for the upper and lower limits of .025.
The 5% error (α) being divided equally into the two tails of the symmetric standard normal distribution.
This leads us to say that in 95% of the samples we will observe a standard normal score between -Z.025 and Z.025 ,or
-Z.025 <=
Z
<= Z.025
Y-µ
-Z.025 <=
<= Z.025
σ/√ n
Giving us the following for 95% confidence limits on the sample mean
UL = Y + Z.025 ( σ / √ n )
(upper limit boundary)
and
LL = Y - Z.025 ( σ / √ n )
(lower limit boundary)
This is the same as saying that in 95% of the samples the interval above will contain the population mean (µ).
Usually the standard deviation of the population, σ, is unknown. In this case, we need to use the estimate s in the
standardization formula:
Y -µ
t = _________
S / √n
This statistic follows the Student’s t-distribution. When it is inverted similar to above, we get the following limits for our
confidence interval:
UL = Y + t.025 ( s / √ n )
(upper limit boundary)
and
LL = Y - t.025 ( s / √ n)
(lower limit boundary)
The t-distributions are symmetrical and are dependent upon degrees of freedom, n-1, in this case. Since the t is affected
by variance and sample size (n), then there is a different t-distribution for every size of sample taken. The sampling
distribution of t has greater dispersion than the normal distribution. If n is very large, then s can be expected to be very
close to σ and the t-distribution can be expected to be very close to normal.
10
The degrees of freedom represent the number of values in a set of data that are free to vary after certain restrictions
have been placed on the data. Suppose one is trying to construct a set of 3 sample values such that the mean of the
set is 7. Under this restriction, 2 of the 3 values can vary as long as their sum does not exceed 21. The third score will
be predetermined based on the values of these two scores. Therefore, once the size of a sample is set and the mean
assumes a specific value, only n-1 scores will be free to vary.
To calculate the upper and lower limits of the confidence interval, we could use the options on the proc
summary/means statement such as:
Alpha= to set the alpha level, default is .05 .
LCLM lower confidence limit of the mean.
UCLM upper confidence limit of the mean.
ILLUSTRATION:
The following code generates 1,000,000 samples of various sizes, and then finds the 95% confidence interval for the
mean for each sample. The population mean is compared to this interval to see if it is contained between the limits
and the percentage of the intervals containing the mean is recorded for each power of 10 sample size. Notice that as
the sample size gets larger and the number of repetitions of each sample size gets larger, that the percentage of
sample confidence intervals that contain the population mean approaches 95%.
proc summary data=saved.wages nway print;;
var wage;
output out=PopMean(drop=_:) mean=PopMean std=sd;
run;
%macro CI(SampSize);
proc surveyselect data=saved.wages(keep=wage)
out=sample
sampsize=&SampSize
rep=1000000;
run;
proc summary data=sample nway;
class replicate;
var wage;
output out=SampleStats(drop=_:) lclm=Lower uclm=Upper;
run;
11
data CIs;
if _n_=1 then set PopMean;
set SampleStats;
SampSize=&SampSize;
if Lower<=PopMean<=Upper then Contains=1;
else Contains=0;
do i=1 to 6;
Power10=i;
if Replicate<=10**i then output;
end;
run;
proc append base=All_CIs data=CIs;
quit;
%mend;
%CI(10);
%CI(20);
%CI(30);
%CI(40);
%CI(50);
proc report data=All_CIs nowd;
col Power10 SampSize, Contains;
define Power10 / group;
define SampSize / across;
define Contains / analysis mean f=percent12.8 ' ';
run;
The SUMMARY Procedure
Analysis Variable : wage
N
Mean
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
534
9.0240637
5.1390969
1.0000000
44.5000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
10
Power10
1
2
3
4
5
6
90.000000%
89.000000%
91.800000%
92.090000%
92.116000%
91.875200%
20
90.000000%
92.000000%
92.900000%
93.280000%
93.351000%
93.283100%
SampSize
30
100.00000%
94.000000%
94.600000%
94.030000%
93.941000%
94.077800%
12
40
100.00000%
98.000000%
95.400000%
94.590000%
94.661000%
94.663000%
50
100.00000%
92.000000%
94.500000%
95.090000%
95.078000%
95.036900%
INTRODUCTION TO HYPOTHESIS TESTING
In the most common use of hypothesis testing, a "straw man" null hypothesis is put forward and it is determined
whether the data are strong enough to reject it. This hypothesis can be defined as a statement about a population
parameter in a single population or about a comparison of parameters from two or more populations. Sample
estimates are used to evaluate the hypothesis about the population parameter.
THE NULL AND ALTERNATIVE HYPOTHESES
The null hypothesis is a statement of no effect or no difference in a population parameter. This “straw man” is
generally expected to be rejected.
The alternative hypothesis represents a statistical statement that indicates a presence of an effect or difference. The
researcher generally expects the alternative hypothesis to be supported.
The decision on whether or not to reject the null hypothesis is based on comparing the results of an experiment with
the outcome that one could expect if the null hypothesis were true. This decision is made using the appropriate
inferential statistical test based on a designated test statistic, which is evaluated in reference to the theoretical
sampling distribution of the statistic.
The probabilities in the sampling distribution are based on the assumption that the sample is randomly selected from
the population it represents. The probability value obtained from the sample is compared to a cutoff probability (level
of significance, alpha) to back up the evidence of the null hypothesis. An alpha of .05 is commonly chosen for the
level of significance and is normally the default in SAS. The p-values produced by SAS procedures are interpreted as
the probability of obtaining a value that would fall further away from the mean of the sample distribution than the observed
value.
The null hypothesis is only rejected when we have evidence beyond a reasonable doubt that a true difference or
association exists in the population from which we drew our random sample. If the probability or p-value is smaller
than the cutoff alpha, then this means that the test statistic of the sample falls further out in the tail of the sampling
distribution or further away from the hypothesized value of the population parameter. It then would be likely that the
sample is more representative of another alternative population since the value of the test statistic is so unlikely to
occur in the defined population by chance.
TESTS OF HYPOTHESIS ON THE MEAN
The sampling distribution of Y is used as the basis for testing hypotheses about the mean parameter of a population,
µ. The test statistic when the population variance, σ, is unknown is the t statistic.
t =
Y - µ0
______
sY
This statistic has a n-1 degrees of freedom, it’s asymptotically normal and tests µ = µ0.
The hypotheses for a two sided test are represented as:
Null Hypothesis
Alternative Hypothesis
H0 : µ = µ0
H1 : µ ^= µ0
The two sided test covers both tails of the distribution and does not define which tail of the distribution the test statistic
may fall in. The hypothesis states whether the test statistic does or does not fall between the tails of the sampling
distribution.
The following is a description of the two-sided hypotheses:
Null
Y
µ0
_______(_______|_________|______________)_______
2.5 % ^--------------- 95% ----------------------------^ 2.5 %
Alternative
Y
µ0
________(________________|______________)___|____
2.5 % ^-------------- 95% -----------------------------^ 2.5 %
13
A one sided test hypothesis would state:
Null Hypothesis
Alternative Hypothesis
H0 : µ >= µ0
H1 : µ < µ0
This one sided test is for testing the hypothesis that the null theorized distribution has µ>=µ0. It is rejected if the
sample statistic falls to the left of the cutoff point on the lower tail and therefore is unacceptable as a representative
for this population.
The following is a description of the one-sided hypotheses:
Null
Alternative
Y
µ0
_________(__|_________|___________________
5%
^--------------- 95% ---------------------------Y
µ0
_____|____(___________|___________________
5%
^-------------- 95% ----------------------------
POSSIBILITIES OF ERROR
There are possibilities of error when making a decision about your hypotheses. The error possibilities are called Type
I Error (level of significance, α) and Type II Error (β).
When you make a decision to reject your null hypothesis based on the fact that the test statistic falls outside of the
limits you have set, it’s possible that you have gathered a sample with little likelihood of occurrence yet it still could
have occurred. Your test indicates with 95% confidence that this would not occur and that the sample is most likely a
representation of another population, but there is a 5% chance of being incorrect with this conclusion. This is a Type I
Error when no differences exist, but you incorrectly reject the null hypothesis and say they do.
The Type II Error is the failure to reject the null hypothesis that there is no difference, when there actually is. This is
equivalent to concluding that a true alternative hypothesis is false and is represented by β. The complement of this (1β) would be the likelihood of rejecting a false null hypothesis and is called the power of the statistical test.
For a particular sample size, these two errors are inversely related. As the probability of one type of error decreases,
the other increases. With an increase in sample size, one can decrease the probability of a Type II Error (and
therefore increase power) while keeping the selected probability of Type I Error or alpha value.
ILLUSTRATION:
The following program illustrates power by choosing 100000 replicates of samples of various sizes, then calculating a
t-test for each replicate for a value of µ different from the actual population value of 9.
The percentage of times the null hypothesis is correctly rejected simulates the power for each test. Notice that the
power increases as the sample size increases for a fixed value of alpha and the null hypothesis.
Power also increases as the hypothesized mean is further away from the actual population mean, i.e. we have more
power to detect larger differences. The power will decrease for the same sample sizes and null hypothesis if alpha is
decreased. This occurs because the β error increases showing how the errors are inversely related.
14
%macro Power(size=30,h0=10,alpha=0.05);
ods listing close;
proc surveyselect data=saved.wages(keep=wage)
out=sample
sampsize=&size
reps=100000;
run;
ods output Statistics=Stats ttests=pvalues;
proc ttest data=sample h0=&h0 alpha=&alpha;
by replicate;
var wage;
run;
ods listing;
data Pow;
set pvalues;
Size=&size;
Alpha=&alpha;
H0=&h0;
if probt<=&alpha then Reject=1;
else Reject=0;
run;
proc append base=Power data=Pow;
quit;
%mend;
%Power(size=10,h0=11);
%Power(size=20,h0=11);
%Power(size=30,h0=11);
%Power(size=40,h0=11);
%Power(size=50,h0=11);
%Power(size=10,h0=11.5);
%Power(size=20,h0=11.5);
%Power(size=30,h0=11.5);
%Power(size=40,h0=11.5);
%Power(size=50,h0=11.5);
%Power(size=10,h0=11.5,alpha=0.01);
%Power(size=20,h0=11.5,alpha=0.01);
%Power(size=30,h0=11.5,alpha=0.01);
%Power(size=40,h0=11.5,alpha=0.01);
%Power(size=50,h0=11.5,alpha=0.01);
proc report data=Power nowd;
col H0 Alpha Size Reject;
define H0 / group;
define Alpha / group;
define Size / group;
define Reject / analysis mean f=percent8.4;
break after H0 / skip;
break after Alpha / skip;
run;
15
Output
H0
11
Alpha
0.05
Size
10
20
30
40
50
Reject
31.38%
44.52%
54.95%
67.28%
74.44%
11.5
0.01
10
20
30
40
50
22.80%
36.59%
51.78%
63.65%
73.07%
0.05
10
20
30
40
50
38.24%
57.96%
72.01%
80.62%
87.71%
ESTIMATION OF A SAMPLE PROPORTION
The sampling distribution of the mean as described in the Central Limit Theorem can also be used to estimate the
proportion of individuals in a population having a certain characteristic. If a value of 1 is assigned to those individuals
in the sample with the characteristic and a value of 0 is assigned to those without the characteristic, then the mean of
the sample,
Y , is an estimate of the population proportion, p.
ILLUSTRATION:
A population is created with p=0.7. This population is sampled to illustrate how the distribution of the sample means
appears for various sample sizes. Samples of each size are replicated 1000 times to obtain a picture of the shape of
the distribution.
data binomial;
do i=1 to 3000;
x=0; output;
end;
do i=1 to 7000;
x=1; output;
end;
run;
%macro SampleBinomialDistribution(Reps,Size);
proc surveyselect data=binomial
out=sample
sampsize=&Size
rep=&Reps;
run;
proc summary data=sample nway;
class replicate;
var x;
output out=SampleProportions(drop=_:) mean=SampleProportion;
run;
proc gchart data=SampleProportions;
vbar SampleProportion / space=0
midpoints=0 to 1 by %sysevalf(1/&Size);
title1 "Distribution of Sample Proportions";
title2 "&Reps Replicates of Sample Size &Size" ;
run;
quit;
%mend;
16
%SampleBinomialDistribution(1000,10);
%SampleBinomialDistribution(1000,15);
%SampleBinomialDistribution(1000,20);
%SampleBinomialDistribution(1000,25);
%SampleBinomialDistribution(1000,30);
%SampleBinomialDistribution(1000,40);
17
Note that as the sample size increases, the sampling distribution of the means approaches a normal distribution with
mean at p=0.7.
The Central Limit Theorem also tells us that as the number of samples increases for a fixed sample size (30), the
distribution of the sample means approaches a normal distribution.
18
CONCLUSION:
The Central Limit Theorem allows us to make inferences about the mean of a population based on the mean of a
sample and its distribution. The sampling distribution of the mean is asymptotically normal no matter what the shape
of the original population from which the sample is taken.
However, the sample that we’ve observed is either representative of the population or not. Any inferences that we
make about that sample are based on what would occur upon repeated sampling from the same population.
REFERENCES:
Introduction to Statistical Analysis, Dixon, WJ & Massey, FJ, McGraw-Hill, Inc., 1969, 3rd edition
Handbook of Parametric and Nonparametric Statistical Procedures, Sheskin, DJ, CRC Press, Inc., 1997
Basic Statistics Using SAS/STAT Software, Destiny Corporation, 1997
SAS Institute Inc. 2005. SAS OnlineDoc® 9.1.3. Cary, NC: SAS Institute Inc.
AUTHOR CONTACT INFORMATION
Richard Read Allen
Peak Statistical Services
Evergreen, CO
www.peakstat.com
email: [email protected]
19