Download Final Exam Review Slides

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Omnibus test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Review
Tests of Significance
Single Proportion
Null Hypothesis: Buzz will randomly pick
a button. (He chooses the correct button
50% of the time, in the long run.) (𝜋 =
1/2)
 Alternative Hypothesis: Buzz
understands what Doris is communicating
to him. (He chooses the correct button
more than 50% of the time, in the long
run.) (𝜋 > 1/2)

Single Proportion
Buzz got it right 15 out of 16 times (𝑝 =
0.9375)
 This is very unlikely (p-value = 0.0005) to
occur by chance.

Single Proportion
Theory-based works well when number of
successes and failures are at least 10.
 A normal distribution is used to predict
what the null distribution looks like.
(These are centered on the proportion
under the null hypothesis.)

Comparing Two Proportions
Null hypothesis: Swimming with dolphins
has no association on if someone shows
substantial improvement
(𝜋dolphins = 𝜋control or 𝜋dolphins  𝜋control = 0)
Alternative hypothesis: Swimming with
dolphins increases the probability of
substantial improvement in depression
symptoms
(𝜋dolphins > 𝜋control or 𝜋dolphins  𝜋control > 0)
Comparing Two Proportions

Our statistic is the observed difference in
proportions 0.67 – 0.20 = 0.47.
Dolphin
group
Control
group
Total
Improved
10 (67%)
3 (20%)
13
Did Not Improve
5
12
17
Total
15
15
30
Comparing Two Proportions



If the null hypothesis is true (dolphin therapy is
not better) we would have 13 improvers and 17
non-improvers regardless of the group they were
in.
Any differences we see between groups arise
solely from the randomness in the assignment to
the groups.
Randomly assign the groups to the improvers and
non-improvers and recalculate the statistic many
times.
Comparing Two Proportions

We did 1000 repetitions to develop a null
distribution and found that just 13 out of
1000 results had a difference of 0.47 or
higher (p-value = 0.013).
Comparing Two Proportions
Just like with a single proportion, the
theory-based test works well when
number of successes and failures are at
least 10 in each group.
 Again, a normal distribution is used to
predict the shape of the null distribution.
(These are always centered at 0.)

Comparing Two Means

Null hypothesis: There is no association
between which bike is used and commute
time


Commute time is not affected by which bike is
used. (µcarbon = µsteel
OR µcarbon – µsteel = 0)
Alternative hypothesis: There is an
association between which bike is used and
commute time

Commute time is affected by which bike is used.
(µcarbon ≠ µsteel OR
µcarbon – µsteel ≠ 0)
Comparing Two Means
Bike type
Carbon frame
Steel frame
Sample
size
26
30
Sample
Sample SD
mean
108.34 min
6.25 min
107.81 min
4.89 min
Our statistic is the observed difference in
means 108.34 – 107.81 = 0.53.
Comparing Two Means
The Original Data



Shuffled Results
Shuffling assumes the null hypothesis that the bike has
no effect on commute times.
Calculate the simulated statistic after shuffling.
Repeating this many times develops a null distribution
Comparing Two Means
Strength of Evidence
 705 of 1000 repetition are 0.53 or farther
away from 0.
 p-value = 0.705.
Comparing Two Means
A theory-based test works well here when
the sample size is at least 20.
 A t-distribution is used to predict the
shape of the null distribution and it is
centered on 0.

Matched Pairs

H0: µd = 0


On average, the mean of the differences
between the running times (narrow – wide) is
0.
Ha : µ d ≠ 0

On average, the mean of the differences in
running times (narrow – wide) is not 0.
Matched Pairs
In this type of test the data starts off as
two separate groups. But there is a
naturally pairing. In this case the times
for the same person running both paths.
 So we need to look at the differences.

Matched Pairs
Subject
narrow
angle
wide
angle
diff

1
5.50
2
5.70
3
5.6
4
5.50
5
5.85
6
5.55
7
5.40
8
5.50
9
5.15
10
5.80
…
5.55
5.75
5.5
5.40
5.70
5.60
5.35
5.35
5.00
5.70
…
0.1
0.15
-0.05 0.05
0.15
0.15
0.10
…
-0.05 -0.05 0.1
Mean difference is 𝑥 d = 0.075 seconds
Matched Pairs




The null basically says the running path doesn’t
matter.
So we can randomly decide which time goes with the
which path (Notice we don’t break our pairs.)
Each time we do this, compute a simulated difference
in means.
We repeat this process many times to develop a null
distribution.
Subject
narrow
angle
wide
angle
diff
1
2
3
4
5
6
7
8
9
10
5.55 5.70
5.50 5.50 5.70
5.60 5.40
5.50 5.15 5.70 …
5.50 5.75
5.60 5.40 5.85
5.55 5.35
5.35 5.00 5.80 …
0.05 -0.05
-0.1
0.05 0.05
0.15 0.15 -0.1
0.1
-0.15
…
Matched Pairs

Only 2 of the 1000 repetitions of random
swappings gave a 𝑥𝑑 value at least as
extreme as 0.075
Matched Pairs
A theory-based test works well when the
sample size is at least 20.
 Like comparing two means, a tdistribution is used to predict the null
distribution.
 The data used in this test are the
differences and this is the same test that
is used for a single mean. (Except in
testing a single mean, we compare data to
any number, not typically just 0.)

Comparing Multiple Proportions

Null hypothesis: There is no association
between the arrival pattern of the vehicle and if
it comes to a complete stop.
(𝜋Single = 𝜋Lead = 𝜋Follow)

Alternative hypothesis: There is an association
between the arrival pattern of the vehicle and if it
comes to a complete stop. The alternative
hypothesis is that (Not all these long-term
probabilities are the same OR at least one is
different).
Comparing Multiple Proportions

MAD (mean absolute difference)
0.858−0.905 + 0.858−0.776 + 0.905−0.776

3
Complete
Stop
Not Complete
Stop
Total
Single
Vehicle
151
(85.8%)
25
(14.2%)
176
Lead
Vehicle
38
(90.5%)
4
(9.5%)
42
= 0.086
Following
Vehicle
76
(77.6%)
22
(22.4%)
98
Total
265
51
316
Comparing Multiple Proportions


If there is no association between arrival
pattern and whether or not a vehicle stops it
basically means it doesn’t matter what the
arrival pattern is. Some vehicles will stop no
matter what the arrival pattern and some
vehicles won’t.
We can model this by shuffling either the
explanatory or response variables. (The
applet will shuffle the response.) and
recomputing the MAD statistic many times.
Comparing Multiple Proportions
Simulated values of the statistic for 1000
shuffles
 P-value = 0.083

Comparing Multiple Proportions
Theory-based tests work well for multiple
proportions if the number of successes
and failures are at least 10. (Just like with
all proportions.)
 The MAD statistic is not used in theorybased, but the chi-squared statistic (and
hence a chi-squared distribution) is.
 This is test is called a chi-squared test of
association.

Comparing Multiple Means

Null: There is no association between whether
and when a picture was shown and
comprehension of the passage
(µno picture = µpicture before = µpicture after)

Alternative: There is an association between
whether and when a picture was shown and
comprehension of the passage
(At least one of the mean comprehension scores will be different.)
Comparing Multiple Means
Means
3.37
3.21
4.95
MAD
= (|3.21−4.95|+|3.21−3.37|+|4.95−3.37|)/3
= 1.16.
Comparing Multiple Means
Simulated values of the statistic for 5000
shuffles
 P-value = 0.0008

Comparing Multiple Means
Since we have a small p-value we can
conclude at least one of the mean
comprehension scores is different.
 We can do pairwise confidence intervals to
find which means are significantly
different than the other means.

Comparing Multiple Means
Theory-based tests work well when we
have at a sample size of least 20 in each
group. (Like all tests with means.)
 The MAD statistic is not used, but an Fstatistic (and hence an F distribution) is.
 Just like the MAD, the larger the Fstatistic, the larger the strength of
evidence (and hence a smaller p-value).
 This test is called Analysis of Variance or
ANOVA.

Correlation/Regression
Null: There is no association between
heart rate and body temperature.
(ρ = 0 or β = 0)
 Alternative: There is a positive linear
association between heart rate and body
temperature.
(ρ > 0 or β > 0)

Correlation/Regression
HR
72
69
72
71
80
81
68
82
68
65
Tmp 98.3 98.2 98.7 98.5 97.0 98.8 98.5 98.7 99.3 97.8
HR
71
79
86
82
58
84
73
57
62
89
Tmp 98.2 99.9 98.6 98.6 97.8 98.4 98.7 97.4 96.7 98.0
r = 0.378
Correlation/Regression
If there is no association, we can break
apart the temperatures and their
corresponding heart rates by scrambling
one of the variables. Just like we did in
previous tests. We will do this by
scrambling one of the variables.
 After each scramble, we will compute the
appropriate statistic, either correlation or
the slope of the regression equation.
 Repeat this many times to develop a null
distribution.

Correlation/Regression

We found that
68/1000 times we
had a simulated
correlation
greater than or
equal to 0.378.
Correlation/Regression



Theory-based test work well when the values of
the response variable are normally distributed
for each value of the explanatory variable and
these normal distributions have similar
variability.
We can use either the correlation or the slope of
the regression line as the statistic.
A t-distribution, centered at 0, is used.
Review
Confidence Intervals
Confidence Intervals
Tests of significance answer yes/no
questions.
 Is there strong evidence that Buzz is not
just guessing?
 Is there strong evidence that swimming
with dolphins help reduce depression
symptoms?
 Sometimes we might just might want an
estimate of a population parameter. E.g.
What proportion of the voters will vote in
the next election?

Confidence Intervals
Confidence intervals are interval
estimates of a population parameter.
 A population parameter is some fixed
measurement for a population such as a
proportion (or long-term probability), a
difference in two proportions, a mean, a
difference in means, or a slope of a
regression equation.
 These intervals give plausible (believable,
credible) values for the parameter.

2SD Confidence Intervals
The observed statistics we found are used
as the center of these intervals.
 We used 2 standard deviations of an
appropriate null distribution as our margin
of error to give us a 95% confidence
interval.

Observed statistic ± 2SD
Remember the observed statistic can be a single
mean or proportion, slope of a regression line, or a
difference in two means or proportions.
Supersize Drinks




A survey found 46% of 1093 randomly selected
NYC voters supported the ban on large soft
drinks.
What is our estimate of the population
proportion that supports the ban?
0.46 ± 2(0.015) or 0.46 ± 0.03
43% to 49%
Meaning of a confidence interval
What does 95% confidence mean?
 If we resampled 1093 NYC voters over
and over and each time produced 95%
confidence intervals, 95% of the time we
would capture the true proportion of all
NYC voters that favor the ban.
 The interval (like the observed proportion)
is random. The population parameter is
fixed.

Theory-based confidence intervals
Using theory-based techniques, confidence
intervals can easily be found and the
confidence levels can easily be adjusted.
 The same validity conditions we use for
tests of significance should also be used
for confidence intervals.

What effects the width of CI?

As the level of confidence increases, the
width of the confidence interval increases.


The wider interval, the more confident we
captured the parameter. (The wider the net,
the more confident we capture the fish.)
As the sample size increases, the width of
the confidence interval decreases.

Larger sample sizes give us more information,
thus we can be more accurate.
Connecting confidence intervals and
tests of significance

A small p-value means that the value
under the null will not be contained in the
confidence interval.
H0: 𝜋 = 0.5, p-value = 0.02, CI (0.52, 0.59)

A large p-value means that the value
under the null will be contained in the
confidence interval.
H0: 𝜋1 − 𝜋2 = 0, p-value = 0.42, CI (-0.28, 0.57)
Significance level and confidence level

Suppose H0: 𝜋 = 0.5, and the
corresponding two-sided p-value = 0.03.
Will 0.5 be contained in a:

90% confidence interval?


95% confidence interval?


no
99% confidence interval?


no
yes
If a the p-value is large (greater than 𝛼),
the value under the null will be contained
in a (1 − 𝛼)% confidence interval.
Review
Big Ideas
Terminology




The population is the entire set of
observational units we want to know
something about.
The sample is the subgroup of the population
on which we actually record data.
A statistic is a number calculated from the
observed data.
A parameter is the same type of number as
the statistic, but represents the underlying
process or the population from which the
sample was selected.
Terminology
Standard deviation (SD) is the most
common measure of variability.
 We can think of standard deviation as
average distance values are from their
mean.
 A distribution is skewed to the right if
the right side extends much farther than
the left side.

Hypotheses and Null Distribution
The null hypothesis (H0) is the chance
explanation. (=)
 The alternative hypothesis (Ha) is you
are trying to show is true. (<, >, or ≠)
 A null distribution is the distribution of
simulated statistics that represent the
chance outcome.

Significance and p-value
Results are statistically significant if
they are unlikely to arise by random
chance (or a true null hypothesis).
 The p-value is the proportion of the
simulated statistics in the null distribution
that are at least as extreme as the value
of the observed statistic.
 The smaller the p-value, the stronger the
evidence against the null.

Guidelines for evaluating strength of
evidence from p-values
p-value >0.10, not much evidence against
null hypothesis
 0.05 < p-value < 0.10, moderate evidence
against the null hypothesis
 0.01 < p-value < 0.05, strong evidence
against the null hypothesis
 p-value < 0.01, very strong evidence
against the null hypothesis

Three S Strategy



Statistic: Compute the statistic from the
observed data.
Simulate: Identify a model that represents a
chance explanation. Repeatedly simulate
values of the statistic that could have
happened when the chance model is true and
form a distribution. (Null distribution)
Strength of evidence: Consider whether the
value of the observed statistic is unlikely to
occur when the chance model is true.
(p-value)
Standardized statistics and 2-sided tests
The standardized statistic is the
number of standard deviations the
observed statistic is above (or below) the
mean of the null distribution.
 Two-sided tests increase the p-value (it
about doubles in simulation-based and
exactly doubles in theory-based)
 Two-sided tests are said to be more
conservative. More evidence is needed to
conclude alternative.

Biased / Simple Random Sampling



A sampling method is biased if statistics from
samples consistently over or under-estimate the
population parameter.
A simple random sample is the easiest way to
insure that your sample is unbiased. Taking
SRS’s allow us to infer our results to the
population from which is was drawn.
Simple random sampling is a way of
selecting members of a population so that every
sample of a certain size from a population has
the same chance of being chosen.
Types of Variables



When two variables are involved in a
study, they are often classified as
explanatory and response
Explanatory variable (Independent,
Predictor)
 The variable we think is “explaining” the
change in the response variable.
Response variable (Dependent)
 The variable we think is being impacted
or changed by the explanatory variable.
Random assignment / Causation
Confounding variables are controlled in
experiments due to the random
assignment of subjects to treatment
groups since this tends to balance out
all other variables between the groups.
 Thus, cause and effect conclusions
are possible in experiments
through random assignment. (It
must be a well run experiment.)

Random vs. Random
With observational studies, random
sampling is often done. This allows us to
make inferences from the sample to the
population where the sample was drawn.
 With experiments, random assignment
is done. This allows us to conclude
causation.

Overall Test
We used one overall test (chi-squared test
or ANOVA) when comparing more than
two proportions or more than two means.
 We do these overall tests since performing
many individual tests increases the
possibility of making a type I error (falsepositive or false alarm).
 If significance is found in an overall test,
then we can follow up with individual tests
or confidence intervals.

Correlation





Correlation measures the strength and
direction of a linear association between
two quantitative variables.
Correlation is a number between -1 and 1.
With positive correlation one variable increases as
the other increases.
With negative correlation one variable decreases as
the other increases.
The closer it is to either -1 or 1 the closer the
points fit to a line.
Regression
The least-squares regression line is
the most common way getting a
mathematical model (linear equation) for
an association between two quantitative
variables.
 Slope is the predicted change in the
response variable for one-unit change in
the explanatory variable.
