Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
10-1 TOPIC (10) – SAMPLING VARIABILITY AND SAMPLING DISTRIBUTIONS Recall that we typically cannot census the entire population of interest so we take a sample from that population in order to make estimates and draw conclusions about the population. The sample mean x is the estimator of the unknown population mean µ.. Similarly, the sample standard deviation is the estimator of the unknown population standard deviation σ . 10-2 1) SAMPLING DISTRIBUTION of the Sample Mean x Important Point:: The value of x will vary with each sample taken from the population. 10-3 EXAMPLE Suppose we had a very small population of 5 units with X-values {2, 4, 8, 10, 14}. What is the frequency distribution of the sample mean x based on a random sample of 2 units? Here, µ = 7.6 and σ = 4.77. Let’s take samples of size 2 with replacement. The total number of possible samples is 15. x 3 5 6 8 6 7 9 9 11 12 2 4 8 10 14 Mean of x : µ x = VAR1 4 3 No of obs Sample (2, 4) (2, 8) (2, 10) (2, 14) (4, 8) (4, 10) (4, 14) (8, 10) (8, 14) (10, 14) (2, 2) (4, 4) (8, 8) (10, 10) (14, 14) 2 1 0 0 2 4 6 8 10 12 Upper Boundaries (x <= boundary) 1 (3 + 5 +"+10 + 14) = 7.6 15 14 Expected Normal 10-4 Std. Deviation of x : σ x = σ n = 4.77 = 3.376 2 We can think of the list of samples (and their x values) as a population of samples, each sample with a value for the variable of interest! Some Things To Note About The Behavior Of Sample Means: 1) 2) x varies from sample to sample (called SAMPLING VARIABILITY) the average of the = the average of the sample means population sampled µx = µ The sample mean x is said to be UNBIASED for the population mean µ 3) The frequency distribution of the sample means does not match the distribution of the original population centered in the same place but the shape and variability (range) are different 10-5 4) Knowing the frequency distribution for the sample means allows us to calculate probabilities about the mean. 5) the variability of the < the variability of the sample means X-values in the population sampled σx < σ 6) The frequency distribution of the sample means is called the SAMPLING DISTRIBUTION of x. Its shape and its variability, σ x , depend on the sample size. Its center, µ x , depends on whether the sampling is unbiased or not. All three characteristics depend on the sampling method (i.e. all can change if the method changes) 10-6 Effects Of Sample Size And Sampling Method Let’s take samples of size 3 with replacement. The total number of possible samples is 35. (4, 10, 10) (4, 4, 14) (4, 14, 14 ) (8, 8, 10) (8, 10, 10) (8, 8, 14) Sample (2, 4, 8) Frequency Distribution of Sample Means, n=3 11 10 9 8 7 No of obs (2, 8, 10) (2, 10, 14) (4, 8, 10) (4, 10, 14) (2, 2, 4) (2, 4, 4) (8, 14, 14) (10, 10, 14) (10, 14, 14) (2, 2, 2) (4, 4, 4) (8, 8, 8) (10, 10, 10) (14, 14, 14) (2, 4, 10) (2, 4, 14) (2, 8, 14) (4, 8, 14) ( 8, 10, 14) (2, 2, 8) (2, 8, 8) (2, 2, 10) (2, 10,10) (2, 2, 14) (2, 14, 14) (4, 4, 8) (4, 8, 8) (4, 4, 10) 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 Upper Boundaries (x <= boundary) Mean of x : µ x = 7.6 Std. Deviation of x : σx = σ 4.77 = = 2.754 n 3 Increasing the sample size made the shape even more normal and decreased the variability as well. Expected Normal 10-7 What is the probability Pr(6.6 < x < 8.6)? We can get an approximate answer using the fact that it looks like x is normally distributed with a mean of 7.6 and a standard deviation of 2.75. Pr( 6.6 < x < 8.6) = Pr F 6.6 − 7.6 < Z < 8.6 − 7.6I H 2.75 2.75 K = Pr( −0.36 < Z < +0.36) = Pr(Z < +0.36) − Pr(Z < −0.36) = 0.6406 − 0.3594 = 0.2812 10-8 SAMPLING DISTRIBUTION of x : Suppose we have a population with a mean µ and a standard deviation σ and we take a sample of size n. As long as the sample is random and either we keep the sample size to less than 5% of the population or otherwise we sample with replacement, the frequency distribution of the sample mean has the following characteristics: 1. 2. µx = µ σ σx = n 3. The shape of the distribution is a) a bell-curve (Normal), if the original population that we sampled has a bell-curve distribution. b) (CENTRAL LIMIT THEOREM) a bell-curve if the sample size is relatively large regardless of the shape of the frequency distribution of the original population. “relatively large” = 30 or more 10-9 EXAMPLE In a study of the evolutionary history of the amphipod Gammarus minus, one of the variables used to distinguish subspecies is the length of the first antennae. If the population found in caves only recently separated from the subspecies found in springs, the length of the antennae should be similar in the two groups. Spring animals have an average first antennal length of 2.9 mm and a population standard deviation of 0.7mm. What is the probability that your sample of 10 cave animals would yield a mean length of 3.1 or larger if the two subspecies split off recently ? First we note that the sample size is relatively small so we need to assume that antennal length is normally distributed (which seems reasonable). Then the sampling distribution of x is Normal with mean µ x = 2.6 and standard deviation of σx = σ n = 0 .7 10 = 0.221. 10-10 Then Pr( x > 3.1) = 1 − Pr( x ≤ 3.1) where ⎛ x − 2 . 6 3 .1 − 2 .6 ⎞ Pr( x ≤ 3.1) = Pr ⎜ < ⎟ 0 . 221 0 . 221 ⎝ ⎠ = Pr(Z < 2.26 ) = 0.9881 So , Pr( x > 3.1) = 1 − 0.9881 = 0.0119 Hence, this event is very unlikely if the two species separated recently. Should your sample actually yield a mean of 3.1 or more, it would imply that the hypothesis that they split recently is wrong! 10-11 1) SAMPLING DISTRIBUTION of the Sample Proportion p If we want to estimate what proportion of the population (π) are in the category we have defined as a success, we take a random sample from that population and calculate the sample proportion in that category (p). The shape of the sampling distribution for p depends very heavily on the sample size n and the population proportion π. EXAMPLE Suppose we had repeatedly tossed n=5 dice where π = 0.5 for Pr(1). The frequency distribution for the sample proportion is: VAR1 800 700 600 No of obs 500 400 300 200 100 0 -1 0 1 2 3 Upper Boundaries (x <= boundary) 4 5 Expected Normal 10-12 The mean of this sampling distribution is 0.5 and the standard deviation is 0.2236. Important Points: For any given sample size, the closer π, the population proportion, is to 1/2, A) the more symmetric the shape of the frequency distribution of the sample proportion p B) the larger the variability of values of p Important Points: For any given value of π, the population proportion, a larger sample size from that population has A) a more symmetric shape for the frequency distribution of the sample proportion p B) a smaller variability in the values of p Let’s put what we’ve learned about sample proportions into one statement: 10-13 SAMPLING DISTRIBUTION of p Suppose we have a population with a binary variable. The proportion of successes in the population is π and we take a random sample of n. As long as the sample is random so that each sampled unit is independent of any other sampled unit, the frequency distribution of the sample proportion has the following characteristics: 1. µp = π 2. σp = π (1 − π ) n 3. (CENTRAL LIMIT THEOREM) The shape of the distribution is approximately normal when n is large and π is not too close to 0 or 1. The further π is from 1/2, the larger n has to be in order for the shape to be a bell-curve. A rule-ofthumb is that the CLT holds if both nπ ≥ 10 and n (1 − π ) ≥ 10 . 10-14 EXAMPLE Suppose that the proportion of a specific form of birth defect was 1 in 1000 live births around the early 1900s. A researcher claims that better hygiene and health care has decreased the rate to something much smaller (say 1 in 10,000 now). To test this hypothesis the scientist collects birth records at random for 25,000 children born in 1999. There were 17 children with the birth defect. What is the probability of observing so few defects or even fewer if the 1 in a 1000 rate is still true? If π = 1/1000 is true then the mean proportion of successes in random samples of 25000 is µ p = π = 0.001 and the standard deviation for a sample proportion is σp = π (1 − π ) = n 0.001(0.999 ) = 0.0002 . A 25000 random sample of 25,000 is sufficiently large for normality but let’s check to make sure: nπ = 25000(0.001) = 25 and of course nπ = 25000(0.999 ) = 24975 . Both are bigger than 10 so we can proceed. 17 ⎞ 0.00068 − 0.001⎞ ⎛ ⎛ Pr ⎜ p ≤ ⎟ = Pr ⎜ Z ≤ ⎟ 25000 ⎠ 0.0002 ⎝ ⎝ ⎠ = Pr(Z ≤ −1.60 ) = 0.0548 There is evidence to suggest that the rate has gone down but it isn’t very strong.