Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STT200 Ch. 18 Chapter 18. Sampling Distribution and Estimators For large number of trials, Binomial distribution of the number of successes x in n trials can be approximated with Normal distribution with THE SAME mean and standard deviation: x Bin(np, npq ) use x Normal (np, npq ) Normal instead Binomial can be used if all original conditions are held: finite number of Bernoulli trials, and Success/failure condition: A Binomial model is approximately Normal if we expect at least 10 successes and 10 failures: np ≥ 10 and nq ≥ 10. Example: Proportion of people with blood type O-negative is 6%.Tennessee Red Cross collected blood from 32,000 donors. What is the probability that they have at least 1850 units of O-negative blood? Solution: (Check conditions: random sample, selected less than 10% of the population, Bernoulli trials with two outcomes per trial and constant probability of success). Normal approximation can be used because np and nq are both at least 10 p=0.06 = np = 32,000 0.06 = 1920 = (32,000 0.060.94) = 42.48 X is approximately N(1920, 42.48) P(X 1850)=Normalcdf(1850, 32000, 1920, 42.48) 0.9503 While Binomial model gives 1-Binomcdf(32000, 0.06,1849)=0.952 ************************************** Inferential statistics: Using information taken from a sample we’ll learn about the whole population The terms (some old, some new): A parameter – a numerical feature of a population: the proportion, mean, median, range, variance, standard deviation, etc. Statistic is any numerical measure calculated from data: the proportion, mean, median, range, variance, standard deviation, etc. An estimator: the statistic used to calculate the estimate: the sample mean, sample variance, sample proportion etc. A point estimate: a single number calculated from a sample that can be used as an educated guess for an unknown population parameter. Usually it is impossible to examine the whole population. We take a sample from the population and our inference about an unknown parameter of the 1 STT200 Ch. 18 population distribution is based on an appropriate statistic computed from the sample (an estimator). For example to estimate the mean across the population (parameter), we may use as the estimator the sample mean x-bar (which is a statistic) Or we can estimate the proportion p across the population (parameter), using as the estimator the sample proportion 𝑝̂ (which is a statistic) We don’t expect any statistic computed from any sample to be exactly equal to true population parameter. But we will use some statistics to estimate the value of parameter. Statistical inference: a method that converts the information from random samples into reliable estimates of the population parameters. Notation: Parameter Estimator Proportion p p̂ Mean x Variance 2 s2 Sampling distribution: The probability distribution of sample statistics is called a sampling distribution. What is a difference between the distribution of proportions of heads in twocoin-toss, and 20-coins-toss? For two coins distribution of successes x is Binomial: p=0.5; n=2, x has the mean =np=1 and standard deviation= npq 2(.5)(.5) 0.707 Example: Experiment: tossing a coin 20 times and observing proportion of heads. How are proportions of heads in twenty-coin-tosses distributed? We’ll toss 20 coins (of one coin 20 times). The size of our sample is n=20, and x=observed number of heads. The sample proportion of heads p-hat=x/n. Find your p-hat: ………….. Next we’ll collect the results obtained by each person. We expect …….% of heads. Find the shape and statistics of the distribution of all p-hats collected so far: (No coins? No problem: simulate on the calculator. Go to MATH, PRB, 5:randInt and type 0,1,5 ENTER. 2 STT200 Ch. 18 The calculator will produce five randomly selected zero-one digits at a time. Click ENTER again to get the list of ten numbers altogether. Count 1 as heads; repeat 3 times to obtain the total of 20 tosses.) Toss a coin 20 times. Write down the proportion of heads. Case # Tally the heads in 20 tosses Proportion of heads (p-hats) 1 12 pˆ1 12 / 20 0.6 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average twenty proportions: Standard deviation of twenty proportions 3 STT200 Ch. 18 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (Our results are collected in this dot—plot) Discuss the shape, mean and standard deviation of this distribution of the sample proportions. To find the distribution of proportions divide x/n, mean/n, and standard deviation/n: Under condition np 10 and nq 10 sample number of heads: x Normal (np, npq ) Sample proportion of heads p p have N ( p, pq n ) N (.5, x np npq N( , ) N( , ) N ( p, n n n n n 0.5*0.5 ) 20 pq n ) N (0.5,0.112) The mean of the proportions of heads for 20-coins-toss distribution remains the same as for two-coin toss, but the spread of the distribution is much smaller! p̂ Because each sample was selected at random, the sample proportion is a random variable. With many repetitions of twenty tosses and taking more coins into our samples the distribution of sample proportions becomes more and more resembling normal model. While we can always find the statistics for any selected sample, in general we don’t know the parameters for the entire population. 4 STT200 Ch. 18 Goal of Inferential Statistics: Use sampling distribution of a statistic to estimate the value of a population parameter with a known degree of certainty. When using a sample statistic to estimate a population parameter, some statistics are good in the sense that they target the population parameter and are therefore likely to give good results. Such statistics are called unbiased estimators. For instance, the sample mean is a good estimator of the population mean. The sample proportion is a good (“unbiased”) estimator of the population proportion. Generalizing: Sampling Distribution for the Proportions Notation: p = population proportion, q = 1-p, 𝑝̂ = sample proportion and 𝑞̂ = 1 − 𝑝̂ Central Limit Theorem for Proportions: For large n (np>10 and nq>10) the sampling distribution of approximately normal: N p, pq n that is, p̂ is approximately normal with mean p̂ is and standard deviation p̂ p̂ p pq n Before you use this theorem to solve a problem always check the conditions: 1. Randomization Condition: The sample should be a simple random sample of the population. 2. 10% Condition: If sampling has not been made with replacement, then the sample size, n, must be no larger than 10% of the population. 3. Success/Failure Condition: The sample size has to be big enough so that both np and nq are at least 10. 5 STT200 Ch. 18 Example: About 13% of the population is left-handed. A large auditorium has 15 "leftie seats". In a class of 90 students, what is the (approximate) probability that there will be not enough seats for the left-handed students? The question translates into What is the probability that in a population with p=0.13, the sample proportion p̂ computed from a sample of n=90 students will be more than .167? Solution: Given: n=90, x=15, sample proportion p̂ = 15/90 = 16.7%, population proportion p=13%. 1. Check the assumptions: The sample is as random as other group of students. Both np = 90*.13 = 11.7 > 10, and nq = 90*.87 = 78.3 > 10. Thus p̂ is approximately normally distributed 2. Use CLT: the distribution of the p-hats is N ( p, pq 0.13 0.87 ) N (0.13, ) N (0.13, 0.035) n 90 , Answer: P( p̂ >.167) = normalcdf(0.167, 10^99, 0.13, 0.035) = 0.145 There is 14.5% chance that there won’t be enough seats for the lefties in that class. Example According to the Centers for Disease Control and Prevention, 18.8% of school-aged children, aged 6-11 years, were overweight in 2004. (In 2011 reached 35%!) (a) Check the conditions. Can sampling distribution model be used? (b) In a random sample of 90 school-aged children, aged 6-11 years, what is the probability that at least 19% are overweight? (c) Suppose a random sample of 90 school-aged children aged 6-11 years results in 24 overweight children. What might you conclude? Solution a) Check the conditions and find the mean and standard deviation of sampling distribution 6 STT200 Ch. 18 Can we state that the distribution of all sample proportions of overweight children is approximately normal with the mean=……… and standard deviation = ………….. Check: yes. The model is N(0.188, 0.0412), that is, N(18.8%, 4.12%) b) Use Normalcdf to answer question. (How? use Normalcdf(“from”, “to”, population proportion, standard deviation). In case of proportions the lowest possible value is 0, and the greatest is 1: there is no need to use positive or negative 10^9, although it is not an error to use them.) Normalcdf(.19, 1, 0.188, 0.0412)=0.481 c) What is the probability of observing at least 24 overweight kids in a random group of 90 kids? (Hint: p-hat=x/n=24/90 and up) Use Normalcdf (or the table) to answer the question: Answer: 0.0281 or 2.81% A comment: If the population proportion of overweight kids is truly 0.188, then our observed proportion is unusual: this or higher proportion of overweighed kids has only less than 3% chance to occur. Less than 3% is an unlikely event! We only expect to see about 3 samples in any 100 of such samples with 24 or more overweight kids out of the selected 90 kids. We can conclude that the proportion of overweighed or obese children is recently more than 18.8%. End of Ch. 18. Almost. Class Exercises: 7 STT200 Ch. 18 8 STT200 Ch. 18 9 STT200 Ch. 18 Central Limit Theorem for the means will be covered in Ch. 23, but the topic matches the subject here… The CLT requires essentially the same assumptions we saw for sample proportions: Randomization Condition: The data values must be sampled randomly. 10% Condition: When the sample is drawn without replacement, the sample size, n, should be no more than 10% of the population. Large Enough Sample Condition: The CLT doesn’t tell us how large a sample we need. The more skewed the population of y is, the larger sample is needed. For normally distributed data samples can be small. Conclusions: 1. Distribution of sample means x-bars will, as the sample size increases, approach normal distribution. In case if original distribution was normal, then sample means have exactly, not approximately, normal distribution 2. The average (mean) of all sample means is equal to population mean: x 3. The standard deviation of all sample means is x n Playing with sampling distributions: http://media.pearsoncmg.com/aw/aw_mml_shared_1/statistics/West_Applets/sampled ist.html Another source to simulate sampling and get better understanding of CLT: http://www.amstat.org/publications/jse/v6n3/applets/CLT.html ************************************************************************ NOTE: The population standard deviation is usually not known. In this case we need to estimate it. Estimates of standard deviation of the estimators are called standard errors Estimator: x p̂ Standard Deviation SD( x ) SD( pˆ ) Standard Error (approx.) SE ( x ) n pq n SE ( pˆ ) 10 s n ˆˆ pq n STT200 Ch. 18 Example: Using CLT to solve a problem The distribution of women’s pulse rates is skewed right with m = 74 bpm, s = 13 bpm. If 30 women are randomly selected find the probability that their average pulse rate is less than 72. Solution: Check conditions. Random? Independent? Less than 10% of all women? Is distribution of pulses normal, or do we have big enough sample? (yes, sample was random, small enough and even though the distribution of pulses is skewed right, the sample size is large enough to use CLT and find characteristics of sampling distribution). The pulse rates for groups of 36 women are normally distributes with the mean 74 bpm and standard deviation approximated by the standard error 13 38 = almost 2.1bpm. 13 N 74, N (74, 2.1) 38 P( x 72) Normalcdf (0,72,74, 2.1) 0.17 Answer: There is about 17% chance to find a random group of 36 women with average pulse rate less than 72 bpm. Note: While the mean and standard deviation of the distribution of single woman’s pulse rate is known (74, 13) we cannot use this information to find P(x<72) unless we know the more about the distribution of single scores. Hints: When working with an individual value taken from normally distributed population, one at a time, use a model N(,) 11 STT200 Ch. 18 When working with the sample size n>1 (when conditions of using CLT are met), use the model N(, / n ) If the original data are not normally distributed, the size of your sample must be large! In all word problems concerning use of Central Limit Theorem we’ll use a calculator to eliminate or limit hand computations. With TI-83/84: If n=1: P(a<x<b)=normalcdf(a, b, population mean, population standard deviation) If n>1: P(a<x-bar<b)=normalcdf(a, b, population mean, population standard deviation/sqrt(n)) 12