Download Sampling Distributions (Chapter 4)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
7 - Sampling Distributions and Confidence Intervals for & p
Introduction:
When take a sample of size n from a population and calculate summary statistics like the
sample mean (X ) , the sample median (med), the sample variance ( s 2 ), the sample
standard deviation (s), or the sample proportion ( p̂ ) we must realize that these quantities
will __________________________________________ and hence are themselves
___________________.
Any random variable in statistics has a probability distribution. We have been talking
about two common probability distributions in statistics. When X = # of “successes” in
n independent trials we used the binomial distribution to talk about X probabilistically,
and when X was continuous and had an approximate bell-shaped distribution we used the
normal distribution to calculate probabilities and quantiles associated with X.
Because the summary statistics discussed above are random variables they also have a
probability distribution that determines the likelihood of certain values of these statistics
being obtained. The distribution of a summary statistic, e.g. the sample mean (X ) is
called the ______________________________________.
In this handout we explore the sampling distributions of the sample mean ( X ) and the
sample proportion ( p̂ ).
Sampling Distribution of X
The sample mean ( X ) is a random quantity that varies from sample to sample. The
probability distribution the sample mean follows is called the sampling distribution of X .
The sampling distribution demo I showed in class is found at the following web address:
http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/
62
The Central Limit Theorem for the Sample Mean (CLT) ~ tells us about
the sampling distributions of the sample mean ( X ). There is also a version (which we
will see later) that tells us about the sampling distribution of the sample proportion ( p̂ ) .
The CLT for X says the following:
1.
2.
3. The sampling distribution will be ___________ if either of the conditions
below are met:

or if

We now consider applications of the central limit theorem (CLT).
Applications to Decision Making
Example 1: Cholesterol levels of adult males (50-60 yrs. old)
The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a
standard deviation of 20 mg/dl. Assume also that blood cholesterol levels are
approximately normally distributed in this population.
a) What is the probability that when taking a sample of size n = 25 that you would obtain
sample mean greater than 225 mg/dl?
b) Give a range of values that we would expect the sample mean to fall approximately
95% of the time.
63
c) Suppose we took sample of adult males between the ages of 50 – 60 who are also
strict vegetarians and obtained sample mean of X  188 mg/dl. Does this provide
evidence that the subpopulation of vegetarians have a lower mean cholesterol level that
the greater population of men in this age group? Explain.
Example 2: Mercury Levels Found in Boulder Reservoir Walleyes
Fish consumption guidelines suggest you should limit the number of fish you eat with Hg
levels above .25 ppm. Is there evidence to suggest that walleyes from Boulder Reservoir
have a mean Hg content exceeding .25 ppm?
64
Confidence Intervals for the Population Mean 
Motivating Example: Suppose we are trying to estimate the mean cholesterol level of
adults in the U.S. 20 years or old. A sample of n = 25 adults in this age group was taken
and their serum cholesterol level was determined and a sample mean of X  206 mg/dl
was obtained.
This is called a _____________________ for the population mean () because it yields a
single value for this unknown quantity.
A better estimate might be 206 give or take _____ units, i.e. ______ up to _______.
This is called an __________________________ as it gives a range or interval of
plausible values for the population mean.
How do we know this if this a good interval estimate? __________________
What properties should a good interval estimate have?
 It

dfk
The central limit theorem states that if our sample size (n) is sufficiently large, then

X 
X ~ N ( ,
) which also implies that after standardizing Z 
~ N (0,1)

n
n
This means that when we collect our data the probability our observed sample mean will
fall within two standard errors of the mean is approximately .95 or a 95% chance, or
being more precise we could use  1.96 standard errors because
P(1.96  Z  1.96)  .9500
Which gives us the following…


 
P   1.96
 X    1.96
  .9500
n
n

For a 99% chance we use _______ and for 90% we use ________ in place of 1.96.
Starting with the statement,


 
P   1.96
 X    1.96
  .9500
n
n

we will perform algebraic manipulations to isolate the population mean in the middle
of this inequality instead. By doing this we will obtain an interval that has a 95% chance
of covering the true population mean.
65
Algebraic Manipulations of the Inequality on the Previous Page:
This says that the interval from X  1.96 

up to X  1.96 

has a 95% chance of
n
n
covering the true population mean . This interval is simply the sample mean plus or
minus roughly two standard errors. However, this interval cannot be calculated in
practice! WHY?
A “simple fix” to this would be replace ____ by the estimated standard deviation from
our data _____.
The problem with our “simple fix” is that the distribution of
X 
is not standard
s
n
normal, i.e. N(0,1) therefore the 1.96 value will not necessarily produce the desired level
of confidence.
FACT: If the population we are sampling from is approximately normal then
X 
has a t-distribution with degrees of freedom df = n – 1.
s
n
What does a t-distribution look like?
Facts about the t-distribution:



66
Examples: Using the t-table to find confidence intervals
a) n = 20 and 95% confidence t =
b) n = 20 and 99% confidence t =
c) n = 50 and 90% confidence t =
d) n = 10 and 95% confidence t =
The basic form of most confidence intervals is:
(estimate)  (table value)( SE of estimate)
MARGIN OF ERROR
General Form for a Confidence Interval for the Mean
For the population mean we have,
X  (t - table value)SE ( X ) or
X t
s
n
The appropriate columns in t-distribution table) for the different confidence intervals are
as follows:
90% Confidence look in the .05 column (if n is “large” we can use 1.645)
95% Confidence look in the .025 column (if n is “large” we can use 1.960)
99% Confidence look in the .005 column (if n is “large” we can use 2.576)
Example: Suppose we are trying to estimate the mean cholesterol level of adults over 20
years of age in the United States. A sample of n = 25 individuals are analyzed for their
serum cholesterol level and we find a sample mean of X  206 mg/dl with a sample
standard deviation of s = 21 mg/dl.
a) Use this information to find a 95% CI for the mean cholesterol level of U.S. adults in
this age group assuming that cholesterol levels are approximately normally distributed.
Suppose a sample of n = 25 adults in the same age group from France was taken and a
sample mean X  235 mg/dl with a standard deviation of s = 24 mg/dl.
b) Find a 95% confidence interval for the mean cholesterol level of adults over 20 years
of age in France.
c) Does this interval in conjunction with the interval obtained for U.S. adults provide
evidence that the mean cholesterol level for this age group is higher in France?
67
Example 2 – Time Spent Studying and Gender for WSU Students
Construct 95% confidence intervals for the mean studying times per day of the
populations of female and male WSU students. Do these intervals suggest one gender
studies more than the other on average?
68
Sampling Distribution of the Sample Proportion ( p̂ )
Just like the sample mean (X ) the sample proportion ( p̂ ) is random, as it too varies
from sample to sample. The sampling distribution of p̂ has the following properties:
1. The mean of the sampling distribution is the population proportion (p)
2. The standard deviation of the sampling distribution or the standard error of
p̂ and is given by:
 pˆ 
p(1  p)
 SE ( pˆ ) where
n
p  population proportion (unknown)
n  sample size
3. The sampling distribution is approx. normal provided n is “sufficiently large”.
np  5
n(1  p )  nq  5
Note: When estimating proportions large sample sizes are generally used
(e.g. n > 100)
69
APPLICATIONS TO DECISION MAKING
Example: New Method for Treating a Certain Illness/Disease
Suppose the current treatment method for certain disease has 70% success rate. A new
method has been proposed that will hopefully have a higher success rate. The new
method is administered to a sample n = 50 patient and 40 have successful treatment.
Can we conclude on the basis of this result that the new method has a higher success
rate?
Using the Binomial Table (this is called a Binomial Exact Test)
70
CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION
Motivating Example – Treating Carpal Tunnel Syndrome
In a recent clinical trial examining the effectiveness of different methods for treating
carpal tunnel syndrome (painful wrist condition) were studied. One of the methods
involved surgery. Of the 88 patients in the study who had surgery 71 showed
improvement. An estimate of the proportion of the patients who will experience
71
 .807 or 80.7%.
improvement after surgery is therefore pˆ 
88
A better estimate might be 80.7% give or take 4%, i.e. estimating that the actual
percentage of patients that have surgery for the carpal tunnel that will experience
improvement is between 76.7% and 84.7%. This is called an “interval estimate”, as it
gives a range or interval of plausible values for the population proportion/percentage. As
with the population mean discussed earlier, we wish this interval to be narrow enough to
provide useful information about this unknown percentage, yet have a high probability or
chance of covering the actual percentage of trout that will die under this catch and release
strategy.
The central limit theorem for proportions states that if our sample size (n) is sufficiently
p(1  p)
large, then pˆ ~ N ( p,
) . This means that when we take our sample and find our
n
sample proportion, p̂ , the probability our observed sample proportion will fall within
approximately two standard errors of the population proportion is roughly 95%, or more
precisely
P( p  1.96 
p(1  p)
p(1  p)
 pˆ  p  1.96 
)  .9500
n
n
 Recall: P 1.96  Z  1.96  .9500
Starting with this statement we can perform some algebraic manipulations to isolate the
population proportion, p,in the middle of the inequality above. By doing this we will see
that the resulting interval will have a 95% chance of covering the true population
proportion (p).
After a wonderful algebraic manipulation of the equality above :

p(1  p)
p(1  p)
up to pˆ  1.96 
has a 95%
n
n
chance of covering the true population proportion p. This interval is simply the sample
proportion plus or minus roughly two standard errors, i.e. pˆ  1.96SE ( pˆ ) . However, this
interval cannot be calculated in practice! WHY?
This says that the interval from pˆ  1.96 
71
A simple fix is to replace ______ by our sample based estimate ________. Provided the
sample size is sufficient large the resulting interval will still have an approximate 95%
chance of covering the true population proportion. This gives what we should technically
call the estimated standard error of the proportion, but when we say “standard error of the
proportion” it is assumed this estimated version is the one we are talking about because in
reality the population proportion p is NOT known. If p were known we would not be
conducting a study in first place!
General Form for a C for Population Proportion (p)
estimate  (table value)  (estimated standard error of estimate)
pˆ  (normal table value) 
Margin of Error  z
pˆ (1  pˆ )
n
or
pˆ  z
pˆ (1  pˆ )
n
pˆ (1  pˆ )
n
Normal Table Values:
95% Confidence we use z = 1.96
90% Confidence we use z = 1.645
99% Confidence we use z = 2.576
Example: Treating Carpal Tunnel Syndrome (cont’d)
In the carpal tunnel study 71 out of 88 patients who had surgery showed improvement.
Use this information to construct a 95% confidence interval for true percentage of carpal
tunnel syndrome patients who will show improvement following surgery.
In the same study of 88 patients were treated less invasively using wrist splints. Amongst
these patients 47 showed improvement. Use this information to construct a 95%
confidence interval for the percentage of carpal tunnel patients who will show
improvement using wrist splints to treat their condition.
72
Comparing the Mortality Rates
Does the two confidence intervals suggest that the percentage of patients who show
improvement following surgery is higher than the percentage of patients who show
improvement using wrist splints? Explain.
Example 2 – Arthritis Rates Amongst Men and Women 65 and Older
The Centers for Disease Control (CDC) reported a survey of randomly selected
Americans age 65 and older, which found that 411 of 1012 men and 535 of 1062 women
suffered from some form of arthritis. Construct 95% confidence intervals for the true
percentages of men and women age 65 and older who suffer from some form of arthritis.
Does it appear that the arthritis rates are different for men and women in this age group?
73