Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Single Sample Statistical Inference, Illustrated Richard Read Allen, Peak Stat, Evergreen, CO ABSTRACT This is a basic statistical tutorial illustrating concepts of statistical inference on a single sample, including the Central Limit Theorem, Estimation, Confidence Intervals, Hypothesis Testing and Power. SAS proc surveyselect will be used to generate samples from various populations to illustrate these concepts graphically and conceptually. INTRODUCTION Inferential statistics are used to draw inferences about a population from a sample. Sample data, when randomly selected, has the capability of mirroring the population from which it was collected. Through the use of statistics we can infer attributes and make predictions about an unknown population, with some degree of confidence, as if we measured the population itself. ESTIMATION In estimation, the sample is used to estimate a parameter and a confidence interval about the estimate is constructed. With statistics called point estimates, we can estimate the different parameters of a population. The mean of a population is µ (mu) and the estimate of that population mean using a sample is population is σ (sigma) and its estimate in the sample is s. Y . The standard deviation of a THE CENTRAL LIMIT THEOREM In a population with mean of µ and standard deviation of σ, the sampling distribution of the means 1. 2. will have mean value of µ and standard deviation of σ/√ n. will approach a normal distribution as the size of each sample increases and as the number of sample means used to generate the distribution increases. The underlying population from which the samples were derived does not need to be normal itself, but the closer it is to normal the required sample size and number of samples necessary to approach normality decreases. Samples of size 30 are generally considered to be adequate for most populations. ILLUSTRATIONS Four different populations of various shapes will be sampled to estimate the population mean and illustrate the distribution of the sample mean according to the Central Limit Theorem. /*-----------------------------------------------------*/ /* Create a uniform population /*-----------------------------------------------------*/ data uniform; do i=0 to 50; do j=1 to 100; x=i; output; end; end; run; ods select Moments BasicMeasures; proc univariate data=uniform; var x; run; proc gchart data=uniform; vbar x / space=0 midpoints=0 to 50 by 1; title "Uniform Population Distribution"; run; quit; 1 /*-----------------------------------------------------*/ /* Create a triangular population /*-----------------------------------------------------*/ data triangle; do i=0 to 99; do j=1 to 100; x=i; if j<=i+1 then output; end; end; run; ods select Moments BasicMeasures; proc univariate data=triangle; var x; run; proc gchart data=triangle; vbar x / space=0 midpoints=0 to 99 by 1; title "Triangular Population Distribution"; run; quit; /*-----------------------------------------------------*/ /* Create a v-shaped population /*-----------------------------------------------------*/ data v_shaped; do i=0 to 99; do j=1 to 100; x=i; if j>=i+1 then output; end; end; do i=100 to 199; do j=101 to 200; x=i; if j<=i+1 then output; end; end; run; ods select Moments BasicMeasures; proc univariate data=v_shaped; var x; run; proc gchart data=v_shaped; vbar x / space=0 midpoints=0 to 199 by 1; title "V-shaped Population Distribution"; run; quit; /*-----------------------------------------------------*/ /* Population of hourly wages /*-----------------------------------------------------*/ ods select Moments BasicMeasures; proc univariate data=saved.wages; var wage; run; proc gchart data=saved.wages; vbar wage / space=0 midpoints=0 to 50 by 0.5; title "Population Distribution of Wages in 1985"; run; 2 THE FOLLOWING FIGURES SHOW THE DISTRIBUTIONS OF THE POPULATIONS GENERATED BY THE ABOVE DATA 3 The following macro will generate the sampling distribution of the mean from each of the above populations replicating each sample size (n) 1000 times This version of the macro generates the sampling distribution of the mean for the wages population. proc format; value $pop ‘uniform’=’0 to 50 by 1’ ‘triangle’=’0 to 99 by 1’ ‘v_shaped’=’0 to 199 by 1’ ‘wages’=’0 to 50 by 0.5’ ; run; %macro SampleMeanDistribution(Pop,Reps,Size); proc surveyselect data=&Pop out=sample sampsize=&Size rep=&Reps; run; proc summary data=sample nway; class replicate; var x; output out=SampleMeans(drop=_:) mean=SampleMean; run; proc gchart data=SampleMeans; vbar SampleMean / space=0 midpoints=%sysfunc(putc(&Pop,$pop.)); title1 "Distribution of Sample Means - &pop Population"; title2 "&Reps Replicates of Sample Size &Size" ; run; quit; %mend; %SampleMeanDistribution(wages,1000,2); %SampleMeanDistribution(wages,1000,5); %SampleMeanDistribution(wages,1000,10); %SampleMeanDistribution(wages 1000,20); %SampleMeanDistribution(wages,1000,30); %SampleMeanDistribution(wages,1000,40); We can use the same macro to generate sampling distributions for the uniform, triangle and v_shaped populations. Note how in each case that as the sample size increases the distribution of the sample means approach normality, just as the Central Limit Theorem assures us. For distributions farthest from normal larger sample sizes are required, but in all cases when n=30 it is pretty apparent that the distributions of the sample means is normal regardless of the form of the original population: 4 5 6 7 8 Recall that the Central Limit Theorem also says that the sampling distribution of the mean not only approaches normality as the sample size increases, but also for a fixed sample size as the number of sample means used to generate the distribution increases. We use a sample size of 30 to illustrate this for these populations. %SampleMeanDistribution(wages,10,30); %SampleMeanDistribution(wages,100,30); %SampleMeanDistribution(wages,500,30); %SampleMeanDistribution(wages,1000,30); %SampleMeanDistribution(wages,10000,30); %SampleMeanDistribution(wages,100000,30); 9 CALCULATING A CONFIDENCE INTERVAL Since the estimate of the statistic measured on one sample can rarely be expected to be exactly the same as the population parameter, one should include a statement of the precision of the estimate. A confidence interval gives an estimated range that is likely to include the unknown population parameter and the likeliness of that interval containing the population parameter. If independent samples are taken repeatedly from the same population and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95% but other levels can be chosen. The Central Limit Theorem tells us that the sampling distribution of the mean tends to normality with mean of µ and standard deviation of σ/√ n. In the standard normal distribution, 95% of the observations should lie within 1.96 standard deviations of the mean. This value is also referred to as Z.025, the Z critical value for the upper and lower limits of .025. The 5% error (α) being divided equally into the two tails of the symmetric standard normal distribution. This leads us to say that in 95% of the samples we will observe a standard normal score between -Z.025 and Z.025 ,or -Z.025 <= Z <= Z.025 Y-µ -Z.025 <= <= Z.025 σ/√ n Giving us the following for 95% confidence limits on the sample mean UL = Y + Z.025 ( σ / √ n ) (upper limit boundary) and LL = Y - Z.025 ( σ / √ n ) (lower limit boundary) This is the same as saying that in 95% of the samples the interval above will contain the population mean (µ). Usually the standard deviation of the population, σ, is unknown. In this case, we need to use the estimate s in the standardization formula: Y -µ t = _________ S / √n This statistic follows the Student’s t-distribution. When it is inverted similar to above, we get the following limits for our confidence interval: UL = Y + t.025 ( s / √ n ) (upper limit boundary) and LL = Y - t.025 ( s / √ n) (lower limit boundary) The t-distributions are symmetrical and are dependent upon degrees of freedom, n-1, in this case. Since the t is affected by variance and sample size (n), then there is a different t-distribution for every size of sample taken. The sampling distribution of t has greater dispersion than the normal distribution. If n is very large, then s can be expected to be very close to σ and the t-distribution can be expected to be very close to normal. 10 The degrees of freedom represent the number of values in a set of data that are free to vary after certain restrictions have been placed on the data. Suppose one is trying to construct a set of 3 sample values such that the mean of the set is 7. Under this restriction, 2 of the 3 values can vary as long as their sum does not exceed 21. The third score will be predetermined based on the values of these two scores. Therefore, once the size of a sample is set and the mean assumes a specific value, only n-1 scores will be free to vary. To calculate the upper and lower limits of the confidence interval, we could use the options on the proc summary/means statement such as: Alpha= to set the alpha level, default is .05 . LCLM lower confidence limit of the mean. UCLM upper confidence limit of the mean. ILLUSTRATION: The following code generates 1,000,000 samples of various sizes, and then finds the 95% confidence interval for the mean for each sample. The population mean is compared to this interval to see if it is contained between the limits and the percentage of the intervals containing the mean is recorded for each power of 10 sample size. Notice that as the sample size gets larger and the number of repetitions of each sample size gets larger, that the percentage of sample confidence intervals that contain the population mean approaches 95%. proc summary data=saved.wages nway print;; var wage; output out=PopMean(drop=_:) mean=PopMean std=sd; run; %macro CI(SampSize); proc surveyselect data=saved.wages(keep=wage) out=sample sampsize=&SampSize rep=1000000; run; proc summary data=sample nway; class replicate; var wage; output out=SampleStats(drop=_:) lclm=Lower uclm=Upper; run; 11 data CIs; if _n_=1 then set PopMean; set SampleStats; SampSize=&SampSize; if Lower<=PopMean<=Upper then Contains=1; else Contains=0; do i=1 to 6; Power10=i; if Replicate<=10**i then output; end; run; proc append base=All_CIs data=CIs; quit; %mend; %CI(10); %CI(20); %CI(30); %CI(40); %CI(50); proc report data=All_CIs nowd; col Power10 SampSize, Contains; define Power10 / group; define SampSize / across; define Contains / analysis mean f=percent12.8 ' '; run; The SUMMARY Procedure Analysis Variable : wage N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 534 9.0240637 5.1390969 1.0000000 44.5000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 10 Power10 1 2 3 4 5 6 90.000000% 89.000000% 91.800000% 92.090000% 92.116000% 91.875200% 20 90.000000% 92.000000% 92.900000% 93.280000% 93.351000% 93.283100% SampSize 30 100.00000% 94.000000% 94.600000% 94.030000% 93.941000% 94.077800% 12 40 100.00000% 98.000000% 95.400000% 94.590000% 94.661000% 94.663000% 50 100.00000% 92.000000% 94.500000% 95.090000% 95.078000% 95.036900% INTRODUCTION TO HYPOTHESIS TESTING In the most common use of hypothesis testing, a "straw man" null hypothesis is put forward and it is determined whether the data are strong enough to reject it. This hypothesis can be defined as a statement about a population parameter in a single population or about a comparison of parameters from two or more populations. Sample estimates are used to evaluate the hypothesis about the population parameter. THE NULL AND ALTERNATIVE HYPOTHESES The null hypothesis is a statement of no effect or no difference in a population parameter. This “straw man” is generally expected to be rejected. The alternative hypothesis represents a statistical statement that indicates a presence of an effect or difference. The researcher generally expects the alternative hypothesis to be supported. The decision on whether or not to reject the null hypothesis is based on comparing the results of an experiment with the outcome that one could expect if the null hypothesis were true. This decision is made using the appropriate inferential statistical test based on a designated test statistic, which is evaluated in reference to the theoretical sampling distribution of the statistic. The probabilities in the sampling distribution are based on the assumption that the sample is randomly selected from the population it represents. The probability value obtained from the sample is compared to a cutoff probability (level of significance, alpha) to back up the evidence of the null hypothesis. An alpha of .05 is commonly chosen for the level of significance and is normally the default in SAS. The p-values produced by SAS procedures are interpreted as the probability of obtaining a value that would fall further away from the mean of the sample distribution than the observed value. The null hypothesis is only rejected when we have evidence beyond a reasonable doubt that a true difference or association exists in the population from which we drew our random sample. If the probability or p-value is smaller than the cutoff alpha, then this means that the test statistic of the sample falls further out in the tail of the sampling distribution or further away from the hypothesized value of the population parameter. It then would be likely that the sample is more representative of another alternative population since the value of the test statistic is so unlikely to occur in the defined population by chance. TESTS OF HYPOTHESIS ON THE MEAN The sampling distribution of Y is used as the basis for testing hypotheses about the mean parameter of a population, µ. The test statistic when the population variance, σ, is unknown is the t statistic. t = Y - µ0 ______ sY This statistic has a n-1 degrees of freedom, it’s asymptotically normal and tests µ = µ0. The hypotheses for a two sided test are represented as: Null Hypothesis Alternative Hypothesis H0 : µ = µ0 H1 : µ ^= µ0 The two sided test covers both tails of the distribution and does not define which tail of the distribution the test statistic may fall in. The hypothesis states whether the test statistic does or does not fall between the tails of the sampling distribution. The following is a description of the two-sided hypotheses: Null Y µ0 _______(_______|_________|______________)_______ 2.5 % ^--------------- 95% ----------------------------^ 2.5 % Alternative Y µ0 ________(________________|______________)___|____ 2.5 % ^-------------- 95% -----------------------------^ 2.5 % 13 A one sided test hypothesis would state: Null Hypothesis Alternative Hypothesis H0 : µ >= µ0 H1 : µ < µ0 This one sided test is for testing the hypothesis that the null theorized distribution has µ>=µ0. It is rejected if the sample statistic falls to the left of the cutoff point on the lower tail and therefore is unacceptable as a representative for this population. The following is a description of the one-sided hypotheses: Null Alternative Y µ0 _________(__|_________|___________________ 5% ^--------------- 95% ---------------------------Y µ0 _____|____(___________|___________________ 5% ^-------------- 95% ---------------------------- POSSIBILITIES OF ERROR There are possibilities of error when making a decision about your hypotheses. The error possibilities are called Type I Error (level of significance, α) and Type II Error (β). When you make a decision to reject your null hypothesis based on the fact that the test statistic falls outside of the limits you have set, it’s possible that you have gathered a sample with little likelihood of occurrence yet it still could have occurred. Your test indicates with 95% confidence that this would not occur and that the sample is most likely a representation of another population, but there is a 5% chance of being incorrect with this conclusion. This is a Type I Error when no differences exist, but you incorrectly reject the null hypothesis and say they do. The Type II Error is the failure to reject the null hypothesis that there is no difference, when there actually is. This is equivalent to concluding that a true alternative hypothesis is false and is represented by β. The complement of this (1β) would be the likelihood of rejecting a false null hypothesis and is called the power of the statistical test. For a particular sample size, these two errors are inversely related. As the probability of one type of error decreases, the other increases. With an increase in sample size, one can decrease the probability of a Type II Error (and therefore increase power) while keeping the selected probability of Type I Error or alpha value. ILLUSTRATION: The following program illustrates power by choosing 100000 replicates of samples of various sizes, then calculating a t-test for each replicate for a value of µ different from the actual population value of 9. The percentage of times the null hypothesis is correctly rejected simulates the power for each test. Notice that the power increases as the sample size increases for a fixed value of alpha and the null hypothesis. Power also increases as the hypothesized mean is further away from the actual population mean, i.e. we have more power to detect larger differences. The power will decrease for the same sample sizes and null hypothesis if alpha is decreased. This occurs because the β error increases showing how the errors are inversely related. 14 %macro Power(size=30,h0=10,alpha=0.05); ods listing close; proc surveyselect data=saved.wages(keep=wage) out=sample sampsize=&size reps=100000; run; ods output Statistics=Stats ttests=pvalues; proc ttest data=sample h0=&h0 alpha=α by replicate; var wage; run; ods listing; data Pow; set pvalues; Size=&size; Alpha=α H0=&h0; if probt<=&alpha then Reject=1; else Reject=0; run; proc append base=Power data=Pow; quit; %mend; %Power(size=10,h0=11); %Power(size=20,h0=11); %Power(size=30,h0=11); %Power(size=40,h0=11); %Power(size=50,h0=11); %Power(size=10,h0=11.5); %Power(size=20,h0=11.5); %Power(size=30,h0=11.5); %Power(size=40,h0=11.5); %Power(size=50,h0=11.5); %Power(size=10,h0=11.5,alpha=0.01); %Power(size=20,h0=11.5,alpha=0.01); %Power(size=30,h0=11.5,alpha=0.01); %Power(size=40,h0=11.5,alpha=0.01); %Power(size=50,h0=11.5,alpha=0.01); proc report data=Power nowd; col H0 Alpha Size Reject; define H0 / group; define Alpha / group; define Size / group; define Reject / analysis mean f=percent8.4; break after H0 / skip; break after Alpha / skip; run; 15 Output H0 11 Alpha 0.05 Size 10 20 30 40 50 Reject 31.38% 44.52% 54.95% 67.28% 74.44% 11.5 0.01 10 20 30 40 50 22.80% 36.59% 51.78% 63.65% 73.07% 0.05 10 20 30 40 50 38.24% 57.96% 72.01% 80.62% 87.71% ESTIMATION OF A SAMPLE PROPORTION The sampling distribution of the mean as described in the Central Limit Theorem can also be used to estimate the proportion of individuals in a population having a certain characteristic. If a value of 1 is assigned to those individuals in the sample with the characteristic and a value of 0 is assigned to those without the characteristic, then the mean of the sample, Y , is an estimate of the population proportion, p. ILLUSTRATION: A population is created with p=0.7. This population is sampled to illustrate how the distribution of the sample means appears for various sample sizes. Samples of each size are replicated 1000 times to obtain a picture of the shape of the distribution. data binomial; do i=1 to 3000; x=0; output; end; do i=1 to 7000; x=1; output; end; run; %macro SampleBinomialDistribution(Reps,Size); proc surveyselect data=binomial out=sample sampsize=&Size rep=&Reps; run; proc summary data=sample nway; class replicate; var x; output out=SampleProportions(drop=_:) mean=SampleProportion; run; proc gchart data=SampleProportions; vbar SampleProportion / space=0 midpoints=0 to 1 by %sysevalf(1/&Size); title1 "Distribution of Sample Proportions"; title2 "&Reps Replicates of Sample Size &Size" ; run; quit; %mend; 16 %SampleBinomialDistribution(1000,10); %SampleBinomialDistribution(1000,15); %SampleBinomialDistribution(1000,20); %SampleBinomialDistribution(1000,25); %SampleBinomialDistribution(1000,30); %SampleBinomialDistribution(1000,40); 17 Note that as the sample size increases, the sampling distribution of the means approaches a normal distribution with mean at p=0.7. The Central Limit Theorem also tells us that as the number of samples increases for a fixed sample size (30), the distribution of the sample means approaches a normal distribution. 18 CONCLUSION: The Central Limit Theorem allows us to make inferences about the mean of a population based on the mean of a sample and its distribution. The sampling distribution of the mean is asymptotically normal no matter what the shape of the original population from which the sample is taken. However, the sample that we’ve observed is either representative of the population or not. Any inferences that we make about that sample are based on what would occur upon repeated sampling from the same population. REFERENCES: Introduction to Statistical Analysis, Dixon, WJ & Massey, FJ, McGraw-Hill, Inc., 1969, 3rd edition Handbook of Parametric and Nonparametric Statistical Procedures, Sheskin, DJ, CRC Press, Inc., 1997 Basic Statistics Using SAS/STAT Software, Destiny Corporation, 1997 SAS Institute Inc. 2005. SAS OnlineDoc® 9.1.3. Cary, NC: SAS Institute Inc. AUTHOR CONTACT INFORMATION Richard Read Allen Peak Statistical Services Evergreen, CO www.peakstat.com email: [email protected] 19