Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
7 - Sampling Distributions and Confidence Intervals for & p Introduction: When take a sample of size n from a population and calculate summary statistics like the sample mean (X ) , the sample median (med), the sample variance ( s 2 ), the sample standard deviation (s), or the sample proportion ( p̂ ) we must realize that these quantities will __________________________________________ and hence are themselves ___________________. Any random variable in statistics has a probability distribution. We have been talking about two common probability distributions in statistics. When X = # of “successes” in n independent trials we used the binomial distribution to talk about X probabilistically, and when X was continuous and had an approximate bell-shaped distribution we used the normal distribution to calculate probabilities and quantiles associated with X. Because the summary statistics discussed above are random variables they also have a probability distribution that determines the likelihood of certain values of these statistics being obtained. The distribution of a summary statistic, e.g. the sample mean (X ) is called the ______________________________________. In this handout we explore the sampling distributions of the sample mean ( X ) and the sample proportion ( p̂ ). Sampling Distribution of X The sample mean ( X ) is a random quantity that varies from sample to sample. The probability distribution the sample mean follows is called the sampling distribution of X . The sampling distribution demo I showed in class is found at the following web address: http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/ 62 The Central Limit Theorem for the Sample Mean (CLT) ~ tells us about the sampling distributions of the sample mean ( X ). There is also a version (which we will see later) that tells us about the sampling distribution of the sample proportion ( p̂ ) . The CLT for X says the following: 1. 2. 3. The sampling distribution will be ___________ if either of the conditions below are met: or if We now consider applications of the central limit theorem (CLT). Applications to Decision Making Example 1: Cholesterol levels of adult males (50-60 yrs. old) The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a standard deviation of 20 mg/dl. Assume also that blood cholesterol levels are approximately normally distributed in this population. a) What is the probability that when taking a sample of size n = 25 that you would obtain sample mean greater than 225 mg/dl? b) Give a range of values that we would expect the sample mean to fall approximately 95% of the time. 63 c) Suppose we took sample of adult males between the ages of 50 – 60 who are also strict vegetarians and obtained sample mean of X 188 mg/dl. Does this provide evidence that the subpopulation of vegetarians have a lower mean cholesterol level that the greater population of men in this age group? Explain. Example 2: Mercury Levels Found in Boulder Reservoir Walleyes Fish consumption guidelines suggest you should limit the number of fish you eat with Hg levels above .25 ppm. Is there evidence to suggest that walleyes from Boulder Reservoir have a mean Hg content exceeding .25 ppm? 64 Confidence Intervals for the Population Mean Motivating Example: Suppose we are trying to estimate the mean cholesterol level of adults in the U.S. 20 years or old. A sample of n = 25 adults in this age group was taken and their serum cholesterol level was determined and a sample mean of X 206 mg/dl was obtained. This is called a _____________________ for the population mean () because it yields a single value for this unknown quantity. A better estimate might be 206 give or take _____ units, i.e. ______ up to _______. This is called an __________________________ as it gives a range or interval of plausible values for the population mean. How do we know this if this a good interval estimate? __________________ What properties should a good interval estimate have? It dfk The central limit theorem states that if our sample size (n) is sufficiently large, then X X ~ N ( , ) which also implies that after standardizing Z ~ N (0,1) n n This means that when we collect our data the probability our observed sample mean will fall within two standard errors of the mean is approximately .95 or a 95% chance, or being more precise we could use 1.96 standard errors because P(1.96 Z 1.96) .9500 Which gives us the following… P 1.96 X 1.96 .9500 n n For a 99% chance we use _______ and for 90% we use ________ in place of 1.96. Starting with the statement, P 1.96 X 1.96 .9500 n n we will perform algebraic manipulations to isolate the population mean in the middle of this inequality instead. By doing this we will obtain an interval that has a 95% chance of covering the true population mean. 65 Algebraic Manipulations of the Inequality on the Previous Page: This says that the interval from X 1.96 up to X 1.96 has a 95% chance of n n covering the true population mean . This interval is simply the sample mean plus or minus roughly two standard errors. However, this interval cannot be calculated in practice! WHY? A “simple fix” to this would be replace ____ by the estimated standard deviation from our data _____. The problem with our “simple fix” is that the distribution of X is not standard s n normal, i.e. N(0,1) therefore the 1.96 value will not necessarily produce the desired level of confidence. FACT: If the population we are sampling from is approximately normal then X has a t-distribution with degrees of freedom df = n – 1. s n What does a t-distribution look like? Facts about the t-distribution: 66 Examples: Using the t-table to find confidence intervals a) n = 20 and 95% confidence t = b) n = 20 and 99% confidence t = c) n = 50 and 90% confidence t = d) n = 10 and 95% confidence t = The basic form of most confidence intervals is: (estimate) (table value)( SE of estimate) MARGIN OF ERROR General Form for a Confidence Interval for the Mean For the population mean we have, X (t - table value)SE ( X ) or X t s n The appropriate columns in t-distribution table) for the different confidence intervals are as follows: 90% Confidence look in the .05 column (if n is “large” we can use 1.645) 95% Confidence look in the .025 column (if n is “large” we can use 1.960) 99% Confidence look in the .005 column (if n is “large” we can use 2.576) Example: Suppose we are trying to estimate the mean cholesterol level of adults over 20 years of age in the United States. A sample of n = 25 individuals are analyzed for their serum cholesterol level and we find a sample mean of X 206 mg/dl with a sample standard deviation of s = 21 mg/dl. a) Use this information to find a 95% CI for the mean cholesterol level of U.S. adults in this age group assuming that cholesterol levels are approximately normally distributed. Suppose a sample of n = 25 adults in the same age group from France was taken and a sample mean X 235 mg/dl with a standard deviation of s = 24 mg/dl. b) Find a 95% confidence interval for the mean cholesterol level of adults over 20 years of age in France. c) Does this interval in conjunction with the interval obtained for U.S. adults provide evidence that the mean cholesterol level for this age group is higher in France? 67 Example 2 – Time Spent Studying and Gender for WSU Students Construct 95% confidence intervals for the mean studying times per day of the populations of female and male WSU students. Do these intervals suggest one gender studies more than the other on average? 68 Sampling Distribution of the Sample Proportion ( p̂ ) Just like the sample mean (X ) the sample proportion ( p̂ ) is random, as it too varies from sample to sample. The sampling distribution of p̂ has the following properties: 1. The mean of the sampling distribution is the population proportion (p) 2. The standard deviation of the sampling distribution or the standard error of p̂ and is given by: pˆ p(1 p) SE ( pˆ ) where n p population proportion (unknown) n sample size 3. The sampling distribution is approx. normal provided n is “sufficiently large”. np 5 n(1 p ) nq 5 Note: When estimating proportions large sample sizes are generally used (e.g. n > 100) 69 APPLICATIONS TO DECISION MAKING Example: New Method for Treating a Certain Illness/Disease Suppose the current treatment method for certain disease has 70% success rate. A new method has been proposed that will hopefully have a higher success rate. The new method is administered to a sample n = 50 patient and 40 have successful treatment. Can we conclude on the basis of this result that the new method has a higher success rate? Using the Binomial Table (this is called a Binomial Exact Test) 70 CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION Motivating Example – Treating Carpal Tunnel Syndrome In a recent clinical trial examining the effectiveness of different methods for treating carpal tunnel syndrome (painful wrist condition) were studied. One of the methods involved surgery. Of the 88 patients in the study who had surgery 71 showed improvement. An estimate of the proportion of the patients who will experience 71 .807 or 80.7%. improvement after surgery is therefore pˆ 88 A better estimate might be 80.7% give or take 4%, i.e. estimating that the actual percentage of patients that have surgery for the carpal tunnel that will experience improvement is between 76.7% and 84.7%. This is called an “interval estimate”, as it gives a range or interval of plausible values for the population proportion/percentage. As with the population mean discussed earlier, we wish this interval to be narrow enough to provide useful information about this unknown percentage, yet have a high probability or chance of covering the actual percentage of trout that will die under this catch and release strategy. The central limit theorem for proportions states that if our sample size (n) is sufficiently p(1 p) large, then pˆ ~ N ( p, ) . This means that when we take our sample and find our n sample proportion, p̂ , the probability our observed sample proportion will fall within approximately two standard errors of the population proportion is roughly 95%, or more precisely P( p 1.96 p(1 p) p(1 p) pˆ p 1.96 ) .9500 n n Recall: P 1.96 Z 1.96 .9500 Starting with this statement we can perform some algebraic manipulations to isolate the population proportion, p,in the middle of the inequality above. By doing this we will see that the resulting interval will have a 95% chance of covering the true population proportion (p). After a wonderful algebraic manipulation of the equality above : p(1 p) p(1 p) up to pˆ 1.96 has a 95% n n chance of covering the true population proportion p. This interval is simply the sample proportion plus or minus roughly two standard errors, i.e. pˆ 1.96SE ( pˆ ) . However, this interval cannot be calculated in practice! WHY? This says that the interval from pˆ 1.96 71 A simple fix is to replace ______ by our sample based estimate ________. Provided the sample size is sufficient large the resulting interval will still have an approximate 95% chance of covering the true population proportion. This gives what we should technically call the estimated standard error of the proportion, but when we say “standard error of the proportion” it is assumed this estimated version is the one we are talking about because in reality the population proportion p is NOT known. If p were known we would not be conducting a study in first place! General Form for a C for Population Proportion (p) estimate (table value) (estimated standard error of estimate) pˆ (normal table value) Margin of Error z pˆ (1 pˆ ) n or pˆ z pˆ (1 pˆ ) n pˆ (1 pˆ ) n Normal Table Values: 95% Confidence we use z = 1.96 90% Confidence we use z = 1.645 99% Confidence we use z = 2.576 Example: Treating Carpal Tunnel Syndrome (cont’d) In the carpal tunnel study 71 out of 88 patients who had surgery showed improvement. Use this information to construct a 95% confidence interval for true percentage of carpal tunnel syndrome patients who will show improvement following surgery. In the same study of 88 patients were treated less invasively using wrist splints. Amongst these patients 47 showed improvement. Use this information to construct a 95% confidence interval for the percentage of carpal tunnel patients who will show improvement using wrist splints to treat their condition. 72 Comparing the Mortality Rates Does the two confidence intervals suggest that the percentage of patients who show improvement following surgery is higher than the percentage of patients who show improvement using wrist splints? Explain. Example 2 – Arthritis Rates Amongst Men and Women 65 and Older The Centers for Disease Control (CDC) reported a survey of randomly selected Americans age 65 and older, which found that 411 of 1012 men and 535 of 1062 women suffered from some form of arthritis. Construct 95% confidence intervals for the true percentages of men and women age 65 and older who suffer from some form of arthritis. Does it appear that the arthritis rates are different for men and women in this age group? 73