Download More on Confidence Intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Estimating with Confidence,
Part II
Review
• We use y-bar to estimate a population mean, µ.
• When sampling from a population with true
mean µ, the true mean of the distribution of ybar is µ.
• On the average, the mean of means from larger
samples should be closer to the true mean than
the mean of the means from smaller samples
• When sampling from
a population with true
standard deviation σ,
the standard
deviation of the
distribution of y-bar is
• Lastly, one nearly magical property of the
sample mean, y-bar, is that it is normally
distributed no matter what the original
distribution of y.
• That is, as long as certain conditions are
met, either:
– the original distribution is normal,
– or if the sample size is large.
Sample Size
• The rule of thumb is-in most practical
situations-n =30 is satisfactory.
• As a practical matter though, if the original
distribution is severely non-normal then it
may take much more 30 samples to
assure us that the sample mean will be
normally distributed.
Central Limit Theorem
• More formally, what we've been discussing
is the implications of the Central Limit
Theorem.
• The CLT is the only theorem we'll cover in
BST 621 (because it's that important).
CLT
• Draw a simple random sample of size
n from any population whatsoever with
mean µ and finite standard deviation
σ.
• When n is large, the sampling
distribution of the sample mean y-bar
is approximately normally distributed
with mean µ and standard deviation
σ/√n (Daniel, p.134)
• It's not surprising that when sampling from a
normal population the means will be normally
distributed.
• It's far more useful to know that no matter what
the underlying distribution is, your means will be
normally distributed, as long as you have
sufficient n.
• How large an n is required? It depends on the
underlying distribution, but the rule of thumb is
30.
• However, this theorem can not save us from an
ill-conceived sampling methodology.
• That is, if we draw a simple random sample then
we can trust that the CLT will hold.
• Say we didn't do a simple random sample; are
we in trouble? We're not in great danger if the
data can plausibly be thought of as observations
taken at random from a population.
• If the data are representative, we're probably
OK.
• However, there is no way to rescue a
study using data collected haphazardly.
• The data will have unknown biases and no
fancy formula can rescue badly produced
data.
• Garbage in, garbage out
• Let’s assume the data are representative.
• So far, our estimation methods have
resulted in point estimates.
• Confidence intervals are even more
useful.
Confidence Intervals
• Confidence intervals use point estimates
and an estimate of dispersion to form
interval estimates.
• Recall that estimating a parameter with an
interval involves three components:
CIs
1. The point estimate of µ .
This is the sample mean y-bar.
2. When the population standard deviation
is known to be σ,
the standard error of y-bar is σ/√n.
3. The reliability coefficient, we use the
100(1- α)% z value
Reliability Coefficients
• For 90% confidence, use z = 1.645.
• For 95% confidence, use z = 1.96.
• For 99% confidence, use z = 2.575.
General Form
Estimate ± (reliability coefficient) x (standard error)
This will yield two values, a lower limit and an
upper limit, around the point estimate.
The confidence interval will, with specified
reliability, contain the true (unknown) population
mean.
Known Variance
• So, if we know the
population standard
deviation, σ , then a
95% confidence
interval for the
population mean is:
Examples
• In our example population the known σ is
45.9194.
• Using a sample of size n = 9, the first
simulated experiment yielded a y-bar of
217.6:
• [187.6, 247.6]
• Notice that this interval covers the true
mean of 205.7
Problem
• There is a major problem with this method
for calculating confidence intervals.
• It requires knowledge of the population
standard deviation, σ .
Unknown Variance
• In practice, we never know σ .
• The obvious solution is to use the estimated
standard deviation, s, we determined from
our sample.
• But this does not work. The problem is that
the reliability coefficient (1.96) is wrong.
• It's wrong because now there are two
random terms entering into the confidence
interval, y-bar and s.
• Both of these are subject to random
fluctuation.
Solution
• Gosset, a statistician who worked at the
Guinness brewery, figured out the solution to
this problem: the t-distribution.
• But to keep from getting fired, he had to publish
the work under a pseudonym "Student."
• Thus, you may have seen a reference to
"Student's t."
• The t-distribution is very close to the z but the t
distribution has wider tails, reflecting the extra
variability ignored by z.
• The degrees of freedom for the tdistribution when estimating a single mean
is df = n - 1.
• It's no accident that this is the denominator
used to calculate s, the estimated
standard deviation.
New CI
• So, the correct formula for the 100(1- α)%
confidence interval on a population mean
when estimating both the mean and
standard deviation is:
• In Appendix Table E, Daniel gives the
appropriate t-values for various df.
• If you use this table, you want to use the
value labeled t.975 for a 95% CI.
• That is for a 95% CI, α = 0.05; so, (1 - α
/2) = 0.975.
• Notice as n gets
larger the t value gets
closer to the z value.
Using JMP
• JMP automatically calculates the 95%
confidence interval on the mean and
shows it in the Distribution of Y report
window.
• For instance, the Moments report from the
first n = 9 cholesterol sample.
• The 95% confidence interval from this
sample is [173.4, 261.7].
Sample Size and Confidence
• A 95% confidence interval implies that
we're 95% sure that the interval covers the
true (but unknown) mean.
• On the other hand, it also means that 5%
of the intervals we calculate will not cover
the true mean.
• This is true whether we use n = 2 or n =
2,000,000.
• What changes with sample size is the
width of the interval.
• With larger sample sizes the width of the
interval is narrower; we're still going to be
wrong 5% of the time but by narrower
amounts.
• Let's look at confidence intervals when the
population is not normal
CI's for Triglyceride
• Now let's look at 95% confidence intervals
using the triglyceride population – the nonnormal population.
• Just as before, we simulate 100 studies,
each with a different sample size.
• Notice how much more variable the widths are.
• The first sample's y-bar estimate was 164.6 and
estimated standard deviation s = 101.2.
• The second sample's y-bar estimate was 323.2
and estimated standard deviation s = 383.3.
• With larger estimates, you're seeing the effect
that an outlier can have.
• With larger n, the intervals are narrower.
• The effect of outliers is diminished.
• Here, we have sufficient sample size to
trust to the Central Limit Theorem.
Summary
• Sample estimates have distributions that are
affected by the underlying distribution and
sample size.
• Estimates may be totally worthless if obtained
from a haphazard "sample" with unknowable
bias.
• But, if the data are representative of the
population then we can rely on the sample mean
to estimate the center of the distribution.
• The sample mean is unbiased.
Summary (cont)
• Further, if the population is known to be
normal, then a sample mean will also be
normal.
• If the population distribution has an
unknown shape then, with a sufficient
sample size, we can rely on the CLT and
trust that a sample mean will also be
normally distributed.
Assessing Normality
• Use the Normal Quantile Plot in JMP to
assess whether a distribution appears
normal.
SD versus SE
• The standard error of the sample mean
will be smaller with larger samples.
• The standard error describes the
variability of the sample mean.
• Not the variability of the sample data.
• If the variability of the data is σ , then the
standard error of the mean is σ/√n .
CIs
• The confidence interval on the population
mean, obtained from a sample of n
observations is:
• Here, y-bar is the sample mean, s is the sample
standard deviation, and the t-reliability
coefficient is the (1 - α/2) percentile of the tdistribution with
df = n - 1.
• When describing a confidence interval in a
sentence or table, be sure to indicate the level of
confidence and the sample size.
• Always be aware that the shape of the
underlying distribution and the size of your
sample will directly affect the believability
of your point- and interval-estimates
Example write ups
• In the case where you judge that the
distribution is markedly non-normal
(skewed), say we begin with the following
raw data:
• Since the sample was small and the distribution
was skewed the distribution of the sample is
described by the median and range:
• A random sample of n = 20 subjects was
assessed for serum triglycerides. The median
triglyceride was 115 and the values ranged
between 31 and 755. Half of the values were
between 91.25 and 195.0.
Another example
• One example write-up: A random sample
of n = 20 subjects was assessed for serum
cholesterol. The average cholesterol was
201.8, SD = 53.25. We are 95% confident
that the range 176.8-226.7 includes the
population mean.