Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sampling Distributions—Statistics as RV’s Sampling Distributions ¾ Sample values are measurements or observations of RVs Î the value for a sample statistic will vary in a random manner from sample to sample New Definition Sampling distribution (of a sample statistic)---the distribution of possible values of a statistic for repeated samples of the same size from a population ¾ Sample statistics are RVs because different samples can lead to different values for the sample statistics 1 2 Statistical Inference Want to make conclusions about population parameters on the basis of sample statistics. Min: 1st Qu.: Mean: Median: 3rd Qu.: Max: Total N: NA's : Std Dev.: 80 ¾ Confidence Intervals—interval of values that the researcher is fairly certain will cover the true, unknown value of the population parameter 60 40 ¾ Hypothesis Tests (significance testing)—uses sample data to attempt to reject a hypothesis about the population 0.00000 78.25000 81.00518 84.00000 91.00000 99.00000 198.00000 4.00000 18.34231 20 0 0 11 22 33 3 66 77 88 99 4 n = 10 mean30 mean50 84.45 82.76 84.30 82.22 82.90 82.06 Sample 4 Sample 5 Sample 6 79.90 86.40 80.40 73.15 81.10 82.61 82.14 80.31 81.47 81.06 78.51 82.52 0 83.05 73.16 78.68 50 50 mean20 86.89 77.40 85.90 0 mean10 Sample 1 Sample 2 Sample 3 n = 20 100 100 SD Midterm Score = 18.34 150 150 200 Mean Midterm Score = 81.01 44 55 Midterm 60 70 80 90 65 70 75 80 85 90 mean20.out[1:1000] n = 50 100 150 n = 30 0 0 50 50 100 150 200 250 mean10.out[1:1000] 70 5 75 80 mean30.out[1:1000] 85 74 76 78 80 82 84 mean50.out[1:1000] 86 88 6 Start with what distribution? 300 300 200 n = 20 50 100 0.10 0.08 0 0 50 100 200 n = 10 60 70 80 90 100 50 60 70 80 90 100 mean20.out[1:1000] 300 300 mean10.out[1:1000] 0.06 0.04 n = 50 200 200 n = 30 p(x) 50 50 100 0.00 0 0 50 100 0.02 50 60 70 80 90 100 50 60 mean30.out[1:1000] 70 80 90 0 100 2 4 6 8 10 x 7 mean50.out[1:1000] 8 1000 observations from Lognormal(0,1) Sampling Distributions for means from U(0,10) 0.3 0.4 n=5 n = 10 X ~ Lognormal ( µ ,σ ) ⇒ 500 0.3 Y = log ( X ) ~ N ( µ ,σ ) 0.2 0.2 0.1 400 0.1 0.0 0.0 1.3 2.6 3.9 5.1 6.4 7.7 9.0 0.0 0.0 0.6 0.8 1.3 2.6 3.9 5.1 6.4 7.7 300 9.0 200 0.5 n = 20 100 0.4 0.3 0.2 0.2 0.1 0.0 0.0 n = 30 0.6 0.4 1.3 2.6 3.9 5.1 6.4 7.7 9.0 0.0 0.0 n=5 0 0.0 1.3 2.6 3.9 5.1 6.4 7.7 9.0 0.3 4.5 6.8 9.0 11.3 13.5 15.8 18.0 20.3 22.5 10 Sampling Distributions n=10 Many common, practical problems involve estimating mean values from samples; interest is really in making an inference about the mean, µ , of some population 0.6 0.4 2.3 9 0.4 0.2 0.2 0.1 0.0 0.0 0.9 1.9 2.8 3.8 4.7 5.7 6.6 7.6 8.5 0.8 0.6 What do we know: the sample mean, x , is often a good estimator of µ 0.0 0.0 0.9 1.9 2.8 3.8 4.7 5.7 6.6 7.6 8.5 n=20 n=30 Example: Midterm exam scores—we will consider the class data to represent the population 0.8 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.9 1.9 2.8 3.8 4.7 5.7 6.6 7.6 8.5 0.0 0.0 0.9 1.9 2.8 3.8 4.7 5.7 6.6 7.6 8.5 Consider histograms for 50 samples 11 x based on n = 10, 20, 30 and 12 } = 81.01 We are using parameter designators since we are taking the full class data to be the population the normal probability distribution approximates the computergenerated sampling distribution very well 82.22 82.76 82.90 80.86 80.71 81.10 80.95 0.04 60 70 80 90 100 70 90 100 90 100 0.15 0.12 0.10 0.0 0.05 0.04 µ = 81.01 80 mean20.out[1:1000] 0.0 60 13 70 80 90 100 60 mean30.out[1:1000] 70 80 mean50.out[1:1000] 14 Sampling Distributions σ = 18.34 Properties of the sampling distribution of x Sample 1 mean10 mean20 mean30 mean50 86.89 83.05 84.45 82.22 Sample 2 Mean of Sampling Distribution based on 1000 draws of size n SD of Sampling Distribution based on 1000 draws of size n Sigma/sqrt(n) 77.40 73.16 82.76 82.90 80.86 80.71 81.10 80.95 5.78 5.80 4.13 4.10 3.09 3.35 2.36 2.59 18.34 / 10 For the midterm scores example: 18 . 34 = 5 . 80 10 18 . 34 σx = = 4 . 10 20 1. Mean of sampling distribution equals mean of sampled population: µx = E ( x ) = µ 2. Standard deviation of sampling distribution equals Standard deviation of sampled population Square root of sample size σx = σ n 15 Sampling Distributions σx = 60 mean10.out[1:1000] 0.08 Mean of Sampling Distribution based on 1000 draws of size n 0.02 84.45 73.16 0.0 83.05 77.40 0.02 mean50 86.89 Sample 2 0.0 mean10 mean20 mean30 Sample 1 0.06 0.06 = 18.34 0.10 µ σ σx = 18 . 34 = 3 . 35 30 18.34 σx = = 2.59 50 16 Central Limit Theorem (CLT) Consider a random sample of n observations selected from a population (any population) with mean µ and standard deviation σ Then, when n is sufficiently large, the sampling distribution of x will be approximately a normal distribution with mean µ x = µ and standard deviation σx = σ n . The larger the sample size, the better will be the normal approximation to the sampling distribution of x If X is normally distributed then x will be normally distributed 17 18 Central Limit Theorem (CLT) Z-scores The sum of a random sample of n observations, ∑ x , will also possess a sampling distribution that is approximately normal for large samples; µ ∑x = nµ; σ ∑x Z≈ = nσ x −µ σ where σ x = σx n approaches Z ⎯⎯⎯⎯ → N ( 0 ,1) as n → ∞ If X ∼ N ( µ ,σ 2 ) then Z ∼ N ( 0 ,1) How large is large? The greater the skewness of the sampled population distribution, the larger the sample size must be before the normal distribution is an adequate approximation for the sampling distribution of x ; for many sampled population, sample sizes of n ≥ 30 will suffice. We do not need to depend on the CLT to “induce” normality; the normality is exact, not approximate 19 20