Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 5: Sampling distributions Outline 1 Introduction 2 Sampling distribution of a proportion 3 Sampling distribution of the mean 4 Normal approximation to the binomial 5 The continuity correction Introduction Sampling distributions What does it mean to take a sample of size n? Y1 , . . . , Yn form a random sample if they are independent and have a common distribution. From a sample, we can calculate a sample statistic such as the sample mean Ȳ . Ȳ is random too. The distribution of Ȳ is called a sampling distribution. Sampling Picture Chalk Board Randomness via sampling. Ȳ , p̂. Sampling distribution of a proportion Estimator for a Proportion A dichotomous trait occurs in a population with an unknown proportion p You wish to estimate p. Steps 1 2 Collect a sample of size n Let Y = total observations which have the trait of interest. Y ∼ B(n, p) (Binomial) An estimator for the unknown p Y p̂ = n Example What proportion (p) of the human population has type O blood (dichotomous because either have O or not). You randomly sample 100 people and observe 42 have type O. Here n = 100, Y = 42. An estimator for p? Y p̂ = = .42 n What would happen if you took a different sample of 100 people? You’d get a different value of p̂. Sampling Distribution of p̂ Y ∼ B(n, p), p̂ = Y n 1 What is the distribution of Y : binomial 2 What is the distribution of p̂: binomial divided by n. Chalkboard Example Write out the distribution of p̂ Will be a hassle if n is big! Mean and Variance of p̂ Y ∼ B(n, p), 1 Moments of Y E(Y ) = np, 2 Y p̂ = n Var(Y ) = np(1 − p) Moments of p̂ E(p̂) = p, p(1 − p) Var(p̂) = n Chalkboard Motivate E(p̂) = p, p(1 − p) Var(p̂) = n Example Example: cross of two heterozygotes Aa × Aa. Probability distribution of the offspring’s genotype: Offspring genotype AA Aa aa 0.25 0.50 0.25 An offspring is dominant if it has genotype AA or Aa. Experiment: Get n = 2 offsprings, count the number Y of dominant offspring, and calculate the sample proportion p̂ = Y /2. We would like p̂ to be close to the “true” value p = 0.75 p̂ is random Distribution of p̂ (from the binomial distribution): Y 0 1 2 p̂ 0.0 0.5 1.0 IP 0.0625 0.3750 0.5625 Example Larger sample size: Y = # of dominant offspring out of n = 20, p̂ = Y /20 the sample proportion. We still want p̂ to be close to the “true” value p = 0.75 p̂ is still random What is the probability that p̂ is within 0.05 of p? Translate into a binomial question IP{0.70 ≤ p̂ ≤ 0.80} = IP{0.70 ≤ Y /20 ≤ 0.80} = IP{14 ≤ Y ≤ 16} = IP{Y = 14} + IP{Y = 15} + IP{Y = 16} = 0.56 Sample size of 20 better than sample size of 2 !! Sampling distribution of the mean Categorical versus Numerical Variables 1 Dichotomous categorical variable (e.g. left-handedness) Natural value to consider is proportion 2 Numerical variables (e.g. height) Natural value to consider is the mean Estimator for the Mean A numerical trait occurs in a population with an unknown mean µ You wish to estimate µ. Collect a random sample of size n Y1 , Y2 , ..., Yn An estimator for the unknown µ n 1X Ȳ = Yi n i=1 Example 1 You want to study the serum cholesterol level of 17 year old males in the US. 2 Specifically, you are interested in the mean (µ) of serum cholesterol levels of all 17 year old males in the US. 3 You collect a sample of size 25. 4 Observe Ȳ = 30. 5 What would happen if you collected a different sample of n = 25, 17 year old males. Example Example: weight of seeds of some variety of beans. Sample size n = 4 Student # 1 2 3 Observations 462 368 607 483 346 535 650 451 579 677 636 529 sample mean ȳ ȳ = 480 ȳ = 495.5 ȳ = 605.25 Ȳ is random. How do we know its distribution? We will see 3 key facts. Key fact # 1 If Y1 , . . . , Yn is a random sample, and if the Yi ’s have mean µ and standard deviation σ, then Ȳ has mean µȲ = µ and variance var(Ȳ ) = σ 2 /n, i.e. standard deviation σ σȲ = √ n Seed weight example: Assume beans have mean µ = 500 mg and σ = 120 mg. In a sample of size n = 4, the sample mean Ȳ has mean √ µȲ = 500 mg and standard deviation σȲ = 120/ 4 = 60 mg. Key fact # 2 If Y1 , . . . , Yn is a random sample, and if the Yi ’s are all from N (µ, σ), then σ Ȳ ∼ N (µ, √ ) n Actually, Y1 + · · · + Yn = n Ȳ ∼ N too. Seed weight example: 100 students do the same experiment. 56 n=4 n=16 30 32 14 17 4 3 350 450 32 550 0 650 350 sample mean 5 450 7 0 550 650 Key fact # 3 Central limit theorem If Y1 , . . . , Yn is a random sample from (almost) any distribution. Then, as n gets large, Ȳ is normally distributed. Note: Y1 + · · · + Yn ∼ normally too. How big must n be? Usually, n = 30 is big enough, unless the distribution is strongly skewed. Remarkable result! It explains why the normal distribution is so common, so “normal”. It is what we get when we average over lots of pieces. Ex: human height. Results from ... Ex: beans are filtered, discarded if too small. n=5 n=1 0 200 400 600 800 1000 300 n = 10 350 400 450 400 500 600 n = 30 500 550 600 650 450 500 550 700 Example: Mixture of 2 bean varieties. n=1 0 200 n=5 400 600 800 1000 300 400 450 500 600 n = 30 n = 10 350 400 500 550 600 650 450 500 550 700 Exercise Snowfall Y ∼ N (.53, .21) on winter days (inches). Take the sample mean Ȳ of a random sample of 30 winter days, over the 10 previous years. What is the probability that Ȳ ≤ .50 in? Ȳ has mean 0.53 inches √ Ȳ has standard deviation 0.21/ 30 = 0.0383 inches Ȳ ’s distribution is approximately normal, because the sample size is large enough (n = 30) Ȳ − 0.53 0.50 − 0.53 IP Ȳ ≤ .50 = IP ≤ .0383 .0383 ' IP{Z ≤ −0.782} = 0.217 Normal approximation to the binomial The normal approximation to the binomial Example: X = # of children with side effects after a vaccine, out of n = 200 children. Probability of side effect: p = 0.05. So X ∼ B(200, 0.05). What is IP{X ≤ 15}? Direct calculation: IP{X = 0} + IP{X = 1} + · · · + IP{X = 15} = 0 200 200 C0 .05 .95 + · · · + 200 C15 .0515 .95185 Or we can use a trick: the binomial might be close to a normal distribution. Pretend X is normally distributed! n= 50 , p= 0.05 n= 200 , p= 0.05 0.12 0.10 0.20 0 2 4 6 8 10 0.00 0.0 0.00 0.02 0.05 0.04 0.10 0.06 0.15 0.08 0.4 0.3 0.2 0.1 Probability 0.5 0.25 n= 10 , p= 0.05 0 2 4 10 0 5 10 15 20 25 0.20 0.15 0.10 0.05 0.00 0.05 0.05 0.10 0.15 0.10 0.20 0.15 0.25 n= 20 , p= 0.9 0.00 Probability 8 n= 20 , p= 0.5 0.25 n= 20 , p= 0.1 6 0 5 10 15 20 0 5 10 15 Some Possible Values 20 0 5 10 15 20 The normal approximation to the binomial X = Y1 + · · · + Y200 where 1 if child #1 has side effects, Y1 = 0 otherwise. 1 if child #200 has side effects, Y200 = 0 otherwise. Apply key result #3: if n (# of children) is large enough, then Y1 + · · · + Yn has a normal distribution. Use the normal distribution with X ’s mean and variance: p µ = np = 10 , σ = np(1 − p) = 3.08 If X ∼ B(n, p) and if n is large enough: if np ≥ 5 and n(1 − p) ≥ 5 (rule of thumb), then X ’s distribution is approximately p N (np, np(1 − p)) The normal approximation to the binomial Back to our question: IP{X ≤ 15}. np = 10 and n(1 − p) = 190 are both ≥ 5, so X ≈ N (10, 3.08). 15 − 10 X − 10 ≤ IP{X ≤ 15} = IP 3.08 3.08 ' IP{Z ≤ 1.62} = 0.9474 True value: > sum( dbinom(0:15, size=200, prob=0.05)) [1] 0.9556444 Continuity Correction The continuity correction 0 5 10 15 # of children with side effect 20 The continuity correction 0 5 10 15 # of children with side effect 20 The continuity correction 0 5 10 15 # of children with side effect 20 The continuity correction X binomial B(200, 0.05), and Y normal N (10, 3.08). No continuity correction: Y − 10 15 − 10 ≤ IP{X ≤ 15} ' IP{Y ≤ 15} = IP 3.08 3.08 = IP{Z ≤ 1.62} = 0.9474 The continuity correction gives a better approximation. Y − 10 15.5 − 10 IP{X ≤ 15} ' IP{Y ≤ 15.5} = IP ≤ 3.08 3.08 = IP{Z ≤ 1.78} = 0.9624 (true value was 0.9556) The continuity correction X binomial B(200, 0.05), and Y normal N (10, 3.08). What is the probability that between 8 and 15 children get side effects? IP{8 ≤ X ≤ 15} ' IP{7.5 ≤ X ≤ 15.5} 7.5 − 10 15.5 − 10 Y − 10 = IP ≤ ≤ 3.08 3.08 3.08 = IP{−0.81 ≤ Z ≤ 1.78} = IP{Z ≤ 1.78} − IP{Z ≤ −0.81} = 0.7535 True value: > sum( dbinom(8:15, size=200, prob=0.05) ) [1] 0.7423397