Download P - FIU

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Let’s further explore the
principles of statistical inference:
using samples to make estimates
about a population.
ƒ
Remember that fundamental to
statistical inference are probability
principles that allow us to answer the
question: what would happen if we
repeated this random sample many
times.
ƒ
According to the laws of
probability, each independent,
random sample of size-n from the
same population yields the
following:
ƒ
true value +/- random error
The procedure, to repeat, must be
a random sample or a randomized
experiment in order for probability
to operate.
ƒ
ƒ If not, the use of statistical
inference is invalid.
ƒ Remember also that sample means are
unbiased estimates of the population mean;
& that the standard deviation of sample
means can be made narrower by increasing
the size of random samples-n.
ƒ Further: remember that means are less
variable & more normally distributed than
individual observations.
ƒ If the underlying population distribution is
normal, then the sampling distribution of
the mean will also be normal.
ƒ There’s also the Law of Large Numbers.
ƒ And last but perhaps most important,
there’s the Central Limit Theorem.
The Point of Departure for Inferential
Statistics
Here, now, is the most basic problem in
inferential statistics: you’ve drawn a
random sample & estimated a sample
mean.
ƒ How reliable is this estimate? After all,
repeated random samples of the same
sample size-n in the same population would
be unlikely to give the same sample mean.
How do you know, then, where the
sample mean obtained would be
located in the variable’s sampling
distribution: i.e. on its histogram
displaying the sample means for all
possible random samples of the same
size-n in the same population?
ƒ
ƒ Can’t we simply rely on
the fact that the sample
mean is an unbiased
estimator of the population
mean?
ƒ No, we can’t: that only means that the
sample mean of a random sample has no
systematic tendency to undershoot or
overshoot the population mean.
ƒ We still don’t know if, e.g., the sample
mean we obtained is at the very low end or
the very high end of the histogram of the
sampling distribution, or is located
somewhere around the center.
In other words, a sample estimate
without an indication of variability is
of little value.
ƒ
ƒ In fact, what’s the worst thing about
a sample of just one observation?
Answer
ƒ A sample of one observation doesn’t
allow us to estimate the variability of
the statistic over repeated random
samples of the same size in the same
population.
ƒ To repeat, a sample estimate
without an indication of variability is of
little value.
ƒ What must we do?
Introduction to Confidence Intervals
ƒ The solution has to do with a mean’s
standard deviation, anchored in the
square root of the sample size-n.
ƒ The first thing we do is compute the
mean’s standard deviation, divided by the
square root of the sample size-n.
ƒ What does the result allow us to do?
ƒ The result allows us to situate the mean’s
variability within the sampling distribution
of the sample mean: the distribution of
means for all possible random samples of
the same size in the same population.
ƒ And to do so in terms of
what rule?
ƒ The 68 – 95 – 99.7 rule.
ƒ The probability is 68% that x-mean
lies within +/- one standard deviation
of the population mean (i.e. the true
value); 95% that x-mean lies within
+/- two standard deviations of the
population mean; & 99.7% that xmean lies within +/- three standard
deviations of the population mean.
A common practice in statistics is to
use the benchmark of +/- two
standard deviations: i.e. a range likely
to capture 95% of sample means
obtained by repeated random samples
of the same size-n in the same
population.
ƒ
We can therefore conclude:
we’re 95% certain that this
sample mean falls within +/- two
standard deviations of the
population mean—i.e. of the true
population value.
ƒ
Unfortunately, it also means that
we still have room for worry: 5%
of such samples will not obtain a
sample mean within this range—
i.e. will not capture the true
population value.
ƒ
ƒ The interval either captures the
parameter (i.e. population mean)
or it doesn’t.
ƒ What’s worse: we never know
when the confidence interval
captures the interval or not.
As Freedman et al. put it, a 95%
confidence interval is “like buying a
used car. About 5% turn out to be
lemons.”
ƒ
ƒ Recall that conclusions are always
uncertain.
In any event, we’ve used our
understanding of how the laws of
probability work in the long run—
with repeated random samples of
size-n from the same population—to
express a specified degree of
confidence in the results of this one
sample.
ƒ
ƒ In sum: the language of statistical
inference uses the fact about what
would happen in the long run to
express our confidence in the
results of any one random sample
of independent observations.
ƒ If things are done right, this is how we
interpret a 95% confidence interval: “We
are 95% confident that the true population
value lies between [low-end value] and
[high-end value].”
ƒ Or: “This number was calculated by a
method that captures the true population
value in 95% of all possible samples.”
To repeat: the confidence
interval either captures the
parameter (i.e. the true
population value) or it
doesn’t—there’s no in
between.
ƒ
More on Confidence
Intervals
ƒ Confidence intervals take the following
form:
Estimate +/- margin of error
ƒ Margin of error: how accurate we
believe our estimate is, based on the
variability of the sample mean in
repeated random sampling of the same
size & in the same population.
ƒ The confidence interval is based on the
sampling distribution of sample means:
N(mu, standard deviation/square root of n)
ƒ It is also based on the Central Limit Theorem,
which says that the sampling distribution of
sample means is approximately normal for large
random samples whatever the underlying
population distribution may be.
Besides the sampling distribution of sample means
& the Central Limit Theorem, the computation of
the confidence interval involves two other
components:
ƒ C-level: i.e. the confidence level, which we’ve
already considered. It defines the probability that
the confidence interval captures the parameter.
ƒ z-score: i.e. the standard score defined in terms
of the C-level. It is the value on the standard
normal curve with area C between –z* & +z*.
Here’s how the z-scores & C-levels are
related to each other:
ƒ z-score: 1.645
ƒ C-level:
90%
1.960
95%
2.57
99%
ƒ Any normal curve has probability C
between the point z* standard
deviations below the mean & point z*
standard deviation above the mean.
ƒ Choose a z-score that corresponds to the
desired level of confidence (1.645 for 90%;
1.960 for 95%; 2.576 for 99%).
ƒ Multiply the z-score times the standard
deviation/square root of the sample size-n
ƒ This anchors the range of the estimated
values of the confidence interval within the
specified probability interval of the
sampling distribution of sample means.
ƒ Notice the darkened areas at the
extremes of the tails on the horizontal
axis.
ƒ Those areas at the extremes demarcate
the critical areas that are integral to
significance tests, which we’ll soon
discuss.
How to do it in Stata
.
ci write
Variable
write
Obs
Mean
Std. Err.
[95% Conf. Interval]
200
52.775
.6702372
51.45332
54.09668
ƒ If the data aren’t in memory:
. cii 200 63.1 7.8
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+------------------------------------------------------------|
200
63.1
.5515433
62.01238
64.18762
Review: Confidence
Intervals
ƒ The chances are in the sampling
procedure, not in the parameter:
confidence intervals, & inferential statistics
in general, are premised on random
sampling & the long-run laws of
probability.
If there’s no random sample or
randomized experiment, the use of a
confidence interval is invalid.
ƒ
ƒ What if you have data for an entire
population? Then there’s no need for
a confidence interval:
congratulations!
ƒ The confidence interval either captures
the parameter or it doesn’t: it’s an
either/or matter.
ƒ We’re saying that we calculated the
numbers according to a method that,
according to the laws of probability, will
capture the parameter in [90% or 95%
or 99%] of all possible random samples
of the same size.”
ƒ That means, though, that a certain
percent of the time (10%, 5% or 1%)
the confidence interval does not capture
the parameter.
ƒ And we won’t know when it doesn’t
capture the parameter.
ƒ By means of their influence on mean x,
extreme outliers & strong skewness can
have a serious effect on the confidence
interval.
ƒ Consider correcting or removing
extreme outliers (if doing so can be
justified) &/or transforming the data, or
else consider using relatively resistant
procedures.
ƒ So, always graph a variable’s data—
checking for pronounced skewness &
outliers—before computing a
confidence interval.
ƒ To make an informed decision,
compute the confidence interval both
before & after modifying the data.
ƒ Compounding the problem are nonprobability (i.e. non-sampling) causes
of bias.
ƒ So we have two sources of
uncertainty: probabilistic (sampling) &
non-probabilistic (non-sampling).
ƒ All conclusions are
uncertain.
How to reduce a confidence
interval’s margin of error?
ƒ Use a lower level of confidence (smaller
C—i.e. narrower confidence interval).
ƒ Increase the sample size (larger n).
ƒ Reduce the standard deviation (via
more precise measurement of variables
or more precise sample design).
Significance Tests
ƒ
What is significance testing?
How do confidence intervals
pertain to significance testing?
ƒ
ƒ
Variability is everywhere.
ƒ “… variation itself is nature’s only
irreducible essence.” Stephen Jay
Gould
ƒ E.g., weighing the same item repeatedly.
ƒ E.g., measuring blood pressure,
cholesterol, estrogen, or testosterone levels
at any various times.
ƒ E.g., performance on standardized tests
or in sports events at various times.
ƒ For any given unbiased
measurement:
measured value = true value
+/- random error
ƒ How do we distinguish an outcome
potentially worth paying attention
to from an outcome based on mere
random variability?
ƒ That is, how do we distinguish a
an outcome potentially worth
paying attention to from an
outcome based on mere chance?
We do so by using probability
to establish that an outcome of
a particular magnitude or level
could rarely have occurred by
mere chance.
ƒ
ƒ Is the magnitude of the
sample mean large enough
relative to its standard
deviation/square root of
sample size-n to have rarely
occurred by chance alone?
ƒ The scientific method tries to make it
hard to establish that such an outcome
occurred for reasons other than chance.
ƒ It makes us start out by asserting that
there’s no effect, or no difference: the null
hypothesis.
ƒ Hence significance tests, like
confidence intervals, are premised on
a variable’s sampling distribution—i.e.
on what would happen with repeated
random samples of the same size in
the same population over the very
long run.
Significance Tests: Stating
Hypotheses
ƒ The null hypothesis—the
starting point for a significance
test—asserts that any observed
effect is due simply to chance:
that there’s no effect, or no
difference, beyond random noise.
By contrast, the alternative
hypothesis asserts that the observed
effect is big enough—relative to the
standard deviation of the
variable/square root of n—that it
would rarely have occurred by sheer
chance: that the observed effect, or
difference, needs to be taken
seriously.
ƒ
The statement being tested
in a significance test is called
the null hypothesis.
ƒ
ƒ The significance test is designed
to assess the strength of the
evidence against the null
hypothesis.
ƒ Usually the null hypothesis is a
statement of “no effect” or “no
difference.”
The alternative to the null
hypothesis is the alternative
hypothesis. It is what we hope to find
evidence to support. The alternative
hypothesis may be one-sided or twosided.
ƒ
ƒ A one-sided example? A two-sided
example?
ƒ The null hypothesis expresses the
idea that the observed difference is
due merely to chance—that it’s a
fluke.
ƒ The alternative hypothesis
expresses the idea that the observed
difference is due to reasons beyond
mere chance.
ƒ Hypotheses always refer to some
population or model, not to a particular
outcome.
ƒ Thus the null hypothesis & alternative
hypothesis must always be stated in
terms of population parameters.
ƒ Put differently, a statistical
hypothesis is a claim about a
population.
ƒ Is the evidence for the sample
statistically consistent with the
population claim or not?
ƒ E.g., null hypothesis & alternative
hypothesis for a study on possible cancer
risks of living near power lines.
ƒ E.g., null hypothesis & alternative
hypothesis concerning relationship of
gender to standardized math test scores.
ƒ Depending on the test results, we
either fail to reject the null
hypothesis or reject the null
hypothesis.
ƒ We never accept the null
hypothesis—why not?
Test Statistic
ƒ A test statistic measures the
compatibility between the null hypothesis
& the data.
ƒ It is computed as a z-score, but
with the standard deviation divided
by the square root of sample size-n.
This adjustment reflects the fact that the
data are drawn from a sample.
Statistical Significance: the P-Value
The probability that the test
statistic would take a value as
extreme or more extreme than
the value actually observed is
called the P-value of the test.
ƒ
ƒ The smaller the P-value, the
stronger the data’s evidence against
the null hypothesis.
ƒ That is, the smaller the P-value, the
stronger the data’s evidence in favor
of the alternative hypothesis. Why?
ƒ The P-value is small enough to be
statistically significant if the
magnitude of the sample mean is
sufficiently large, in relation to its
standard deviation/square root of
sample size-n.
ƒ The P-value, to repeat, is the
observed significance level.
ƒ The P-value is based on the
sampling variability of the
sample mean.
ƒ A P-value is located within the extremes
of the tails of the horizontal axis, as
defined by this formula: sample mean +/(z-score times standard deviation)/square
root of the sample size-n.
ƒ Depending on the form of the
alternative hypothesis, the significance
test may be two-tailed or one-tailed.
ƒ The P-value is small enough to be
statistically significant if the
magnitude of the sample mean is
sufficiently large in relation to its
standard deviation/square root of
sample size-n.
ƒ If the P-value is as smaller or
smaller than a specified significance
level (conventionally .10 or .05 or
.01), we say that the data are
statistically significant at that level
(P-value=….).
How to do it in Stata
.
ttest math = 55
One-sample t test
Variable
Obs Mean
math
200 52.645
Degrees of freedom:
Std. Err.
Std. Dev.
.6624493
9.368448
[95% Conf. Interval]
51.33868
53.95132
199
Ho: mean(math) = 55
Ha: mean < 55
Ha: mean ~= 55
Ha: mean > 55
t = -3.5550
t = -3.5550
t = -3.5550
P<t=
P>t=
P>t=
0.0002
0.0005
0.9998
ƒ For example: two-tailed test—
reject null hypothesis (Pvalue=.0005, df=199) in favor of
the alternative hypothesis.
. ttest read = math
Paired t test
Variable
Obs
Mean
Std. Err.
Std. Dev.
read
200
52.23
.7249921
10.25294
50.80035
53.65965
math
200
52.645 .6624493
9.368448
51.33868
53.95132
diff
200
-.415
8.103152
-1.54489
.7148905
.5729794
[95% Conf. Interval]
Ho: mean(read - math) = mean(diff) = 0
Ha: mean(diff) < 0
Ha:mean(diff) ~= 0
Ha: mean(diff) > 0
t = -0.7243
t = -0.7243
t = -0.7243
P<t=
P >t =
P>t=
0.2349
0.4697
0.7651
ƒ Two-tailed test: fail to reject null
hypothesis (P-value=.470).
ƒ One-tailed (lower): fail to reject null
hypothesis (P-value=.235).
ƒ One-tailed (upper): fail to reject null
hypothesis (P-value=.765).
What Statistical Significance Isn’t, &
What It Is
ƒ Statistical significance does not
mean theoretical, substantive or
practical significance.
ƒ In fact, statistical significance may
accompany a trivial substantive or
practical finding.
ƒ Statistical significance means that if
the null hypothesis were actually true,
then a finding of the observed
magnitude would be likely to occur by
chance no more than—depending on
the specified significance level—ten
times or five times or one time out of
100 observations.
ƒ Depending on the test results,
either we fail to reject the null
hypothesis or we reject the null
hypothesis.
ƒ We never accept the null
hypothesis: Why not?
Regarding statistical significance, it’s
useful to think (more or less) in
terms of the following scale:
ƒ .10 or less: moderate statistical
significance
ƒ .05 or less: strong statistical significance
ƒ .01 or less: very strong statistical
significance
ƒ These levels are cultural conventions in
statistics & research methods.
ƒ There’s really no rigid line between
statistical significance & lack of such
significance, or between any of the levels
of significance.
ƒ Listing the P-value provides a more
precise statement of the evidence.
ƒ E.g.: fail to reject the null
hypothesis at the .10 level (Pvalue=.142).
Let’s remember, moreover:
statistical significance does not mean
theoretical, substantive, or practical
significance.
ƒ
In any event, statistical
significance—as conventionally
defined—is much easier to
obtain in a large sample than a
small sample.
ƒ
ƒ Why?
ƒ Because according to the formula, a
sample statistic’s standard deviation
divided by the square root of n
decreases as the sample size
increases.
What does it take to obtain
statistical significance?
A strong linear relationship
between two variables.
ƒ
ƒ A large enough sample size to
minimize the role of chance in
determining the finding.
ƒ Consequently, lack of statistical
significance may mean that the
relationship is nonlinear; or it may
simply mean that the sample size is
not large enough to downplay the role
of chance in determining the finding.
ƒ It might also mean there are data errors,
that the sample is badly designed or
executed, or that there are other problems
with the study’s design.
ƒ Of course, it may indeed mean that the
linear relationship between the two
variables simply is not strong enough to
minimize the role of chance in causing the
observed finding.
Statistical significance does not
necessarily mean substantive or
practical significance.
ƒ
ƒ And remember: significance tests
are premised on a random sample
of independent observations.
Hence, statistical significance tests
are invalid if the sample cannot be
reasonably defended as random or if
measurements are obtained for an
entire population (the latter being a
very good thing, however).
ƒ
Without a random sample, the laws
of probability can’t operate.
ƒ
ƒ With measurements on an entire
population, there is no samplingbased uncertainty to test (or worry
about).
ƒ Confidence intervals & two-
sided tests: the two-sided
test can be directly
computed from the
confidence interval.
ƒ For a two-sided test, if the
hypothesized value falls
outside the confidence
interval, then we can reject the
null hypothesis.
ƒ Why?
ƒ Ho: math equals 55
ƒ Ha: math does not equal 55
. ci math
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+----------------------------------------------------------math |
ƒ
200
52.645
.6624493
51.33868 53.95132
Reject null hypothesis at the .05 level.
In general, P-values are
preferable to fixed significance
levels: Why?
ƒ
Review: Significance Testing
ƒ Significance testing is premised on
a random sample of independent
observations: if this premise does
not hold, then the significance tests
are invalid.
ƒ Statistical significance does not
mean theoretical, substantive or
practical significance.
ƒ Any finding of statistical
significance may be an artifact
of large sample size.
ƒ Any finding of statistical
insignificance may be an
artifact of small sample size.
Moreover, statistical
significance or insignificance in
any case may be an artifact of
chance.
ƒ
ƒ
What does a significance test mean?
ƒ What does a significance test not
mean?
ƒ What is the procedure for
conducting a significance test?
ƒ
What is the P-value?
ƒ Why is the P-value preferable to
a fixed significance level?
What are the possible reasons why
a finding does not attain statistical
significance?
ƒ
ƒ What are the possible reasons why
findings are statistically significant?
ƒ Depending on the test results, we
either fail to reject the null
hypothesis or reject the null
hypothesis.
ƒ We never accept the null
hypothesis.
Beware!
ƒ There is no sharp border between
“significant” & “insignificant”, only
increasingly strong evidence as the Pvalue gets smaller.
ƒ There is no intrinsic reason why
the conventional standards of
statistical significance must be .10
or .05 or .01.
ƒ Don’t ignore lack of statistical
significance: it may yield important
insights.
ƒ Beware of searching for significance:
by chance alone, a certain percentage of
findings will indeed attain statistical
significance.
There’s always uncertainty
in assessing statistical
significance.
ƒ
Another Problem: Two Types
of Error in Significance Tests
If a finding tests significant, the
null hypothesis may be wrongly
rejected.
ƒ
ƒ If a finding tests insignificant,
the null hypothesis may be
wrongly accepted.
ƒ Type I error
ƒ Type II error
Power: the power of a significance
test measures its ability to detect an
alternative hypothesis. Power
against a specific alternative is
calculated as the probability that the
test will reject the null hypothesis
when the alternative is true.
ƒ
Type I/II Errors & Power in Stata
ƒ See Stata “help” &/or the
documentation manual for the
command “sampsi.”
Bonferroni adjustment
ƒ When there are multiple hypothesis
tests, the Bonferroni adjustment makes
it tougher to obtain statistical
significance: What’s the reason for
doing so?
ƒ Divide the selected critical value by
the number of hypothesis tests.
ƒ Selected critical level: .05
ƒ Five tests
ƒ .05/5=.01
ƒ Thus, each test will be judged as
statistically significant only at Pvalue=.01 or less.
ƒ There are other such “multiple
adjustments,” such as Scheffe,
Sidak, Tukey.
Review Again
ƒ What’s the premise of significance
tests? What if the premise doesn’t
hold?
ƒ What is the procedure for conducting a
significance test?
ƒ What do significance tests mean?
What don’t they mean?
What conditions yield a statistically
significant finding? What conditions
don’t yield such a finding?
ƒ
ƒ Why is a P-value preferable to a
fixed significance level?
ƒ Why are .10, .05 & .01 significance levels
so commonly used?
ƒ How should we treat statistically
insignificant findings?
ƒ Why shouldn’t we search for statistical
significance?
ƒ Why is a finding of statistical
significance uncertain? Why is a finding of
statistical insignificance uncertain?
ƒ What are Type I errors?
ƒ What are Type II errors?
ƒ What’s a Bonferroni adjustment
(or other “multiple adjustment”)?
ƒ Why is it used?
For what various reasons are
conclusions inherently uncertain?
ƒ
Significance Testing:
Questions
True or false, & explain:
ƒ A difference that is highly significant
must be very important.
ƒ Big samples are bad.
ƒ If the null hypothesis is rejected, the
difference isn’t trivial. It is bigger than
what would occur by chance.
ƒ For one year in one graduate major at a
university, 825 men applied & 62% were
admitted; 108 women applied & 82% were
admitted. Is the difference statistically
significant?
ƒ The masses of the inner planets average
0.43 versus 74.0 for the outer planets. Is
the difference statistically significant? Does
this question make sense?
ƒ A P-value of .047 means something quite
different from one of .052.
ƒ According to the U.S. Census, in 1950
13.4% of the U.S. population lived in the
West; in 1990 21.2% lived in the West. Is
the difference statistically significant?
Practically significant?
Morals of the Stories
Statistical significant says nothing
about:
ƒ practical significance
ƒ the adequacy of the study’s design
ƒ whether or not the study is based on a
random sample of independent
observations
ƒ Professional standards of statistical
significance are cultural conventions:
there’s no intrinsic, hard line between
statistical significance &
insignificance.
ƒ Findings of statistical insignificance
may be more insightful than those of
statistical significance.
ƒ Finally, confidence intervals &
significance tests are based on a
random variable’s sampling
distribution: over all possible
random samples of the same
size in the same population.