Download Hypothesis tests, confidence intervals, and bootstrapping

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Hypothesis tests, confidence intervals,
and bootstrapping
Business Statistics 41000
Fall 2015
1
Topics
1. Hypothesis tests
I
I
I
I
I
I
I
Testing a mean: H0 : µ = µ0
Testing a proportion: H0 : p = p0
Testing a difference in means: H0 : µ1 − µ2 = 0
Testing a difference in proportions: H0 : p1 − p2 = 0
Testing a difference in means: H0 : µ1 − µ2 = 0 (paired sample)
Testing a difference in means: H0 : µ1 − µ2 = 0 (same variance)
Simulating from a null distribution
2. Confidence intervals
3. Bootstrap confidence intervals
Read chapter 15 from Kaplan, chapter 9 in Naked Statistics and chapters
4-6 of OpenIntro
2
Homeless guys don’t wear nice shoes
The guy asking for your change outside Alinea is wearing a posh pair of
loafers. Would you be willing to conclude that he’s not actually homeless
on the basis of this evidence?
To make things numerical, assume we recognize the shoes and know for a
fact they cost $285, or 5.65 log dollars.
Also assume that the distribution of log-prices of homeless guys’ shoes is
described by X ∼ N(3.7, 0.62 ). Then we find P(X > 4.69) = 0.05 (using
qnorm(0.05,3.7,0.6,lower.tail = FALSE)).
So, if we call out all supposed homeless guys with shoes worth more than
exp(4.69) = $108 we’ll only do so incorrectly 5% of the time.
3
Homeless guys don’t wear nice shoes
0.010
0.005
0.000
Density
0.015
0.020
Shoe price distribution for homeless dudes
0
50
100
150
Price in dollars
4
Homeless guys don’t wear nice shoes
0.3
0.2
0.1
0.0
Density
0.4
0.5
Homeless guy shoe prices in log-dollars
1
2
3
4
5
6
7
x
5
Homeless guys don’t wear nice shoes
To turn this classification problem into a hypothesis testing problem,
we must phrase the question in terms of probability distributions and
their parameters.
Assume the data we observe — the shoe price — was a draw from a
normal probability distribution with an unknown mean, µ, and a known
variance (for now), σ 2 = 0.62 .
If the guy were homeless, then µ = 3.7. So we want to test the
hypothesis
H0 : µ = µ0
where µ0 = 3.7.
6
Logic of hypothesis tests
Consider a normal random variable
0.2
0.1
0.0
Density
0.3
0.4
X ∼ N(µ, σ 2 ).
µ ! 3!
µ ! 2!
µ!!
µ
µ+!
µ + 2!
µ + 3!
x
7
Logic of hypothesis tests
Imagine observing a single draw of this random variable, call it x.
0.2
0.1
0.0
Density
0.3
0.4
Assume the variance σ 2 is known, but the mean parameter µ is not.
µ ! 3!
µ ! 2!
µ!!
µ
µ+!
µ + 2!
µ + 3!
x
8
Logic of hypothesis tests
0.2
0.0
0.1
Density
0.3
0.4
Intuitively, this single observed value x tells us something about the
unknown parameters: more often than not, the observed value will tend
to be close to the parameter value.
µ ! 3!
µ ! 2!
µ!!
µ
µ+!
µ + 2!
µ + 3!
x
Then again, sometimes it will not be. But it will only rarely be too far off.
9
Logic of hypothesis tests
Assume we have a guess in mind for our true parameter value. We
denote this guess by µ0 , pronounced “mew-naught”.
We refer to this as the “null hypothesis”, which we write:
H0 : µ = µ0 .
The symbol µ is the “true value” and µ0 is the “hypothesized value”.
10
Logic of hypothesis tests
0.2
0.0
0.1
Density
0.3
0.4
Hypothesis testing asks the following question: if the true value were µ0 ,
is my data in an unlikely region?
µ0 ! 3!
µ0 ! 2!
µ0 ! !
µ0
µ0 + !
µ0 + 2!
µ0 + 3!
x
If we consider it too unlikely, we decide not to believe our hypothesis and
we “reject the null hypothesis”.
11
Logic of hypothesis tests
0.2
0.1
0.0
Density
0.3
0.4
On the other hand, if the data falls in a likely region, we decide our
hypothesis was plausible and we “fail to reject” the null hypothesis.
µ0 ! 3!
µ0 ! 2!
µ0 ! !
µ0
µ0 + !
µ0 + 2!
µ0 + 3!
x
12
Level of tests
0.2
0.0
0.1
Density
0.3
0.4
Where do we put the rejection region? In general, it depends on the
problem (more on that in a minute).
µ0 ! 3!
µ0 ! 2!
µ0 ! !
µ0
µ0 + !
µ0 + 2!
µ0 + 3!
x
But one thing is always true: the probability of the rejection region (the
area under the curve) dictates how often we will falsely reject the null
hypothesis. This is called the level of the test.
13
Level of tests
0.2
0.1
0.0
Density
0.3
0.4
Because when the null hypothesis is true, we still end up in unusual areas
sometimes. How often this happens is exactly the level of the test.
µ0 ! 3!
µ0 ! 2!
µ0 ! !
µ0
µ0 + !
µ0 + 2!
µ0 + 3!
x
14
Where to put the rejection region
0.2
0.0
0.1
Density
0.3
0.4
One way to think about rejection regions is in terms of alternative
hypotheses, such as
HA : µ > µ 0 .
µ0 − 3σ
µ0 − 2σ
µ0 − σ
µ0
µ0 + σ
µ0 + 2σ
µ0 + 3σ
x
I prefer to think of it the other way around: where we place our rejection
region dictates what the alternative hypothesis is, because it determines
what counts as unusual.
15
Where to put the rejection region
0.2
0.0
0.1
Density
0.3
0.4
For HA : µ < µ0 the rejection region is on the other side.
µ0 ! 3!
µ0 ! 2!
µ0 ! !
µ0
µ0 + !
µ0 + 2!
µ0 + 3!
x
In all of the pictures so far, the level of the test has been α = 0.05.
There is nothing special about that number.
16
Where to put the rejection region
0.2
0.0
0.1
Density
0.3
0.4
We could even have a rejection region in a small sliver around the null
hypothesis value.
µ0 − 3σ
µ0 − 2σ
µ0 − σ
µ0
µ0 + σ
µ0 + 2σ
µ0 + 3σ
x
Perhaps this would reflect evidence of cheating of some sort: the data fit
too well.
17
More than one observation
To apply this logic to more than one data point, we simply collapse our
data into a single number, or statistic, and figure out the sampling
distribution of this statistic. Then we proceed as before.
In this lecture we will use sample means as our test statistic.
Conveniently, if we have n samples, each drawn independently
iid
Xi ∼ N(µ, σ 2 ) we have the result that
X̄ ∼ N(µ, σ 2 /n)
where X̄ = n−1
Pn
i=1
Xi .
18
Did something change?
You have implemented a new incentive policy with your sales force and
you want to measure if the new policy is translating to increased sales.
Previously sales hover around $50,000 a week, with a standard deviation
of $6,000. The first five weeks have produced the following sales figures
(in thousands of dollars) of
[61, 52, 48, 43, 65].
Do you reject the null hypothesis that nothing has changed?
19
Did something change?
Our test statistic is X̄ . Under the null distribution
X̄ ∼ N(50, 62 /5).
We observe x̄ =
269
5
= 53.8.
We want our “unusual” region to be unusually high sales. At a level of
10% the rejection region starts at 53.44, so we reject.
At a level of 5%, the rejection region starts at 54.4, so we fail to reject.
20
0.15
0.10
0.00
0.05
Density
0.20
0.25
Did something change?
40
45
50
55
60
x
The empirical or sample mean falls in the 10% rejection region (but not
the 5% rejection region).
21
p-values
The largest level at which we would reject our observed value is called
the p-value of the data.
In other words, the p-value is the probability of seeing data as, or more,
extreme than the data actually observed.
So the p-value will change depending on the shape of the rejection region.
So a p-value larger than the level of a test, implies that you fail to reject.
A p-value smaller than the level of a test implies you reject.
22
Application to a proportion
We ask n = 50 cola drinkers if they prefer Coke to Pepsi; 28 say they do.
Can we reject the null hypothesis that the two brands have evenly split
the local market?
We can approach this problem using a normal approximation to the
binomial. Under the null distribution, the proportion of Coke drinkers has
an approximate
N(0.5, 0.52 /50)
distribution.
We observe x̄ = 28
50 = 0.56. The p-value is the area under the curve less
than 0.44 and greater than 0.56.
23
Coke vs Pepsi
3
0
1
2
Density
4
5
The p-value is an uncompelling 40%.
x
What would happen if we had the same observed proportion of Coke
drinkers, but a sample size of 200?
24
Coke vs Pepsi
6
0
2
4
Density
8
10
At n = 200 we have reduced our standard deviation by a factor of 2.
x
Our p-value drops to 0.09.
25
Variance unknown
So far we have been considering normal hypothesis tests when the
variance σ 2 is known. Very often it is unknown.
But if we have a sample of reasonable size (say, more than 30), then we
can use a plug-in estimate without much inaccuracy.
That is, we use the empirical standard deviation (the sample standard
deviation) as if it were our known standard deviation: we treat σ̂ as if it
were σ.
26
Are mountain people taller?
It is claimed that individuals from the Eastern mountains are much taller,
on average, than city dwellers who are known to have an average height
of 67 inches.
Of course, some mountain people are hobbits, so obviously there is a lot
of variability.
Based on a sample of 35 mountain people we measured, we find x̄ = 73
and σ̂ = 12.6.
Can we reject the null hypothesis that there is no difference in height?
27
Are mountain people taller?
0.10
0.00
0.05
Density
0.15
We assume that our test statistic is distributed X̄ ∼ N(67, 12.62 /35)
under the null distribution.
60
65
70
75
80
85
x
Our p-value is 0.00242.
28
Z scores
In normal hypothesis tests where the rejection region is in the tail, we’re
essentially measure the distance of our observed measurement from the
mean under the null distribution. “How far is too far” is determined by
the level of our test and by the standard deviation under the null.
To get a sense of how far into the tail an observation is, we can
standardize our observation.
If X ∼ N(µ, σ 2 ), then X −µ
σ ∼ N(0, 1). Applying this idea to a normal test
statistic tells us how many standard deviations away from the mean our
observed value is.
In this last example we would get z =
x̄−67
√
12.6/ 35
= 2.82.
29
Z scores
The usefulness of this approach is mainly that we can remember a few
special rejection regions.
P(Z > 2.33) = 1%
P(Z > 1.64) = 5%
P(Z > 1.28) = 10%
This defines rejection regions for one-sided tests at those 1%, 5% and
10% respectively. (Include a negative sign as the circumstances require.)
The analogous two-sided thresholds are given by
P(Z > 2.57) = 0.5%
P(Z > 1.96) = 2.5%
P(Z > 1.64) = 5%.
We arrive at these numbers by dividing the test level by 2 and putting
half of it in the left tail and half of it in the right tail.
30
Difference of two means
A common use of hypothesis testing is to compare the means between
two groups based on observed data from each group.
For example, we may want to compare a drug to a placebo pill in terms
of how much it reduces a patient’s weight. In this case we have
Xi ∼ N(µX , σX2 )
and
Yj ∼ N(µY , σY2 )
for i = 1, . . . , n and j = 1, . . . , m.
Our test statistic in this case will be X̄ − Ȳ , the difference in the
observed sample means.
31
Better than a placebo?
Our test is
H0 :µX − µY = 0,
HA :µX > µY ,
which defines a rejection region in the right tail.
The test statistic has null distribution of
X̄ − Ȳ = D̄ ∼ N(0, σX2 /n + σY2 /m)
which we approximate as
N(0, σ̂X2 /n + σ̂Y2 /m).
32
Better than a placebo?
We observe that 34 patients receiving treatment have a mean reduction
in weight of 5 pounds with standard deviation of 4 pounds. The 60
patients in the placebo group show a mean reduction in weight of 3
pounds with a standard deviation of 6 pounds.
Can we reject the null hypothesis at the 5% level?
In this case
(5 − 3) − 0
= 1.933
z=p
42 /34 + 62 /60
so we reject at the 5% level because P(Z > 1.933) < 5%.
If this were a 5% two-sided test, would we reject?
33
Difference in proportions
Suppose we try to address the Coke/Pepsi local market share with a
different kind of survey in which we conduct two separate polls and ask
each person either “Do you regularly drink Coke?” or “Do you regularly
drink Pepsi?”.
With this set up we want to know if pX = pY .
Suppose we ask 40 people the Coke question and 53 people the Pepsi
question. In this case the observed difference in proportions has
approximate distribution
D ∼ N(0, s 2 )
where
r
s=
p1 (1 − p1 ) p2 (1 − p2 )
+
.
40
53
34
Difference in proportions
In practice we have to use
r
ŝ =
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
+
.
40
53
If 30/40 people say that they regularly drink Coke and 30 out of 53
people say they regularly drink Pepsi, do we reject the null hypothesis at
the 10% level?
We find
r
ŝ =
so z =
0.75(1 − 0.75) 0.566(1 − 0.566)
+
= 0.09655
40
53
(0.75−0.566)−0
ŝ
= 1.905.
Do we reject at the 5% level?
35
Paired samples
As a final variant, sometimes data from two groups comes paired, which
changes yet again how we approximate our variance term.
Suppose I want to know which route to work is faster. On each day in
February I take the Lakeshore route and my colleague takes the MLK
route.
I average 22.2 minutes with a standard deviation of 3.75 minutes and he
averages 20.8 minutes with a standard deviation of 3.96 minutes. Which
way is faster?
36
Paired sample
Because the samples are paired (in the sense of each happening on the
same day) we can directly approximate the variance of the difference
Di = Xi − Yi by
n
X
¯2
σ̂D = n−1
(di − d)
i=1
where di = xi − yi .
So the extra bit of information we need is that the standard deviation of
the daily difference between the commute times was 4 minutes.
In this case we have, under the null,
D̄ = X̄ − Ȳ ∼ N(0, 42 /28)
and the observed difference is d¯ = 1.4 leads to z =
cannot reject at the 5% level.
¯
d−0
√
4/ 28
= 1.85. We
37
Practical vs. Statistical significance
Suppose that the average commute times differ by 2 seconds. Do we
care?
If we have enough observations, we will eventually reject the null
hypothesis and conclude that they are “statistically significantly
different”.
However, that result says nothing of the effect size — the magnitude of
the difference.
This inability for the statistical machinery to distinguish between the two
types of significance is sometimes called Lindley’s paradox:
With enough data, you’ll reject any hypothesis at all!
38
Power
1.0
See section 4.6 in OpenIntro
0.6
0.4
0.2
Probability of rejecting
0.8
Two-sided
One-sided right tail
se = 1
0.0
5% level
-4
-2
0
2
4
Underlying mean
The probability of rejecting the null hypothesis is called the power of the
test. It will depend on the actual underlying value. The level of a test is
precisely the power of the test when the null hypothesis is true.
39
0.6
0.4
Two-sided
One-sided right tail
0.2
Probability of rejecting
0.8
1.0
Power
se = 0.1
0.0
5% level
-2
-1
0
1
2
Underlying mean
The power function gets more “pointed” around the null hypothesis value
as the sample size gets larger (which makes the standard error smaller).
40
Other test statistics
Nothing obliges us to use the sample mean as our test statistic, other
than convenience.
A manufacturing facility produces 15 golf carts a day. Some days it
produces more, some days less.
We can model this variability with a normal distribution with mean 15
and standard deviation 3.
The last 25 production days saw very low-production numbers. Has the
production facility has changed: has the N(15, 32 ) description of the
production variability become something left skewed?
41
Skewed golf cart production
To test this hypothesis we consider the difference between the mean
number of carts produced in the last period (13.96) and the median (15).
We simulate (using R) from the null hypothesis (nothing has changed)
and find that the distribution of the difference D has the following
quantiles:
0.5%
-1.15
1%
-1.03
2.5%
-0.87
5%
-0.72
How many total carts were produced in the last production period?
What is the statistic used in this hypothesis test?
42
Skewed golf cart production
0.5%
-1.15
1%
-1.03
2.5%
-0.87
5%
-0.72
Is a one-sided test or two-sided test is more appropriate?
Would we reject the null hypothesis at the 1% level?
43
Confidence intervals
A confidence interval consists of two numbers, a lower bound and an
upper bound.
The idea is to give a range of possible values for the
true/unobserved/underlying parameter. The basic goal is to achieve
coverage: we want our interval to “capture” the true value more often
than not.
One way to guarantee this is to make our intervals huge, but we can be
more clever by using ideas from hypothesis testing.
44
Confidence intervals
We consider normal confidence intervals for simplicity. Let X ∼ N(µ, σ 2 ).
We know the following facts (with a bit of algebra)
P(µ − 1.96σ < X < µ + 1.96σ) = 0.95
P(−1.96σ < X − µ < 1.96σ) = 0.95
P(−X − 1.96σ < −µ < −X + 1.96σ) = 0.95
P(X − 1.96σ < µ < X + 1.96σ) = 0.95.
This tells us that the random interval will overlap the mean 95% of the
time.
45
Confidence interval
This means that for normally distributed data, if we construct our
interval estimate as
x ± 1.96σ
where x is our observed data, that we will cover the true value 95% of
the time.
Naturally, we can apply this to X̄ as well, yielding an interval of the form
σ
x̄ ± 1.96 √ .
n
46
Confidence interval
If we use a different number instead of 1.96, we can get different levels of
coverage. For instance, an interval of the form
x ± 1.64σ
has 90% coverage.
You will notice a straightforward relationship between confidence
intervals and hypothesis tests:
I
A null hypothesis value µ0 outside the confidence interval implies an
observation inside the rejection region.
I
A null hypothesis value µ0 inside the confidence interval implies an
observation outside of the rejection region.
47
Asymmetric confidence intervals
Confidence intervals don’t have to be symmetric.
By the same algebra as before
P(µ − a < X < µ + b) = 0.95
P(−a < X − µ < b) = 0.95
P(−X − aσ < −µ < −X + bσ) = 0.95
P(X − b < µ < X + a) = 0.95.
48
Asymmetric confidence intervals
0.2
0.1
0.0
Density
0.3
0.4
Such a confidence interval is based on an asymmetric rejection region.
µ0 ! 3!
µ ! 2!
µ!!
µ
µ+!
µ + 2!
µ + 3!
x
49
Simulation demo
50
Bootstrapping
Recall the idea of bootstrapping introduced last lecture. To get a sense
of our sampling variability, we simply resample our data (with
replacement) to get a sample of the same size n.
We compute our estimate on this sample over and over (thousands of
times) and visualize the results in a histogram.
We can use this approach to construct a bootstrap confidence interval:
an interval which contains 1 − α of the bootstrap estimates.
51
Bootstrapping a mean
2
0
1
Frequency
3
4
Let’s try this idea out on our mountain people example.
50
60
70
80
90
Height in inches
A normal confidence interval would be 73 ±
1.96(12.6)
√
.
35
52
Bootstrap samples
Here is the code.
There are dedicated R packages for bootstrapping, but this one is “by
hand”.
53
Bootstrap confidence intervals
Here is a side-by-side comparison of the two results.
68
70
72
74
76
78
Inches
We find (68.63, 77.17) with the standard approach (black) and
(68.75, 77.01) with the bootstrap approach (red).
54