Download 2 Hypothesis testing as the scientific path for significant results

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Hypothesis testing
TRIBE statistics course
Split, spring break 2016
Goal
Concept of the null hypothesis H0
Know the procedure at hypothesis testing
Error of type 1 (false positive) and 2 (false negative)
2
When to take action
Auric Goldfinger (in James Bond 'Goldfinger'):
'Mr. Bond, they have a saying in Chicago:
Once is happenstance.
Twice is coincidence.
The third time it is enemy action.'
3
Trouble with knowledge
From which certainty level on do you claim to know rather than believe?
Maybe never
=> hardly any knowledge at all
Personal choice
=> preferences matter
Varying personal α level in different settings: dice versus lottery
In statistics
Only negative results for sure
Results for all α levels by the threshold given by the p-value
Never sure despite statistically significant results
4
Standard procedure
1. Formulate a
null hypothesis H0
2. Identify a
test statistic
3. Compute the
p-value
4. Compare
p-value versus α level
5
The H0 world
Virtual world: omission of everything unnecessary
Model: connections between variables, distributions, parameters, ε
Not necessarily wrong: else, rejecting it hardly an achievement
6
Model with no error = a definition
Examples
Kelvin versus °C
linear with slope 1
Fahrenheit versus °C
linear
Variance versus standard deviation
quadratic
Measurement errors still possible
7
The falsification principle for H0
An outcome (of a test statistic) in the sample which is too extreme
(= less likely ex ante than α in percent)
leads to a rejection of the null hypothesis
If you live in a H0 world,
you wrongly reject the null in α (in percent) of independent samples
8
Statistics can prove something wrong
Absolute
Realizations outside the distribution like 7 at standard dice
With any (freely chosen) degree of conviction but never certainty
Realizations that would have been unlikely ex ante
(corresponds to the standard hypothesis testing)
Failure
Wrong decision about the null hypothesis of due to random effects
(errors of type 1 and 2)
9
Statistics cannot prove something to be correct
Without (model or measurement) errors, there is no need for statistics
With errors, the result could (almost always) result from those
(depending on the possible outcomes of the error under the null)
Even if the sample outcome is 'likely' under the null hypothesis,
it could truly result from another distribution,
and this other distribution must satisfy no other condition than
assigning a positive probability to the outcome in the sample
=> Prove, no – Support, yes
10
1-sided versus 2-sided tests
Choice depends on the alternative to H0
If you suspect that the true value of your
test statistic exceeds the average of this
test statistic under the null hypothesis,
a relatively low value in the sample
does not support your alternative H1
Once the direction of the deviation
is given by the sample, of course
the 2-sided test sets a stricter threshold
for the rejection of the H0
Story matters ex ante, not only ex post
(otherwise, the choice of 'only' a 1-sided
test might be considered as fiddling)
11
Hypothesis testing calculator (example)
12
Hypothesis testing in EViews
SeriesViewDescriptive Statistics & TestsSimple hypothesis tests
Hypothesis Testing for HEIGHT
Date: 04/05/16 Time: 17:41
Sample (adjusted): 1 364
Included observations: 363 after adjustments
Test of Hypothesis: Mean = 183.0000
Sample Mean = 184.0937
Sample Std. Dev. = 10.91945
Method
t-statistic
Value
1.908256
13
Probability
0.0571
Limits to H1
None in principle but
statements only possible about the sample in relation to H0
H1 bound to changes in the H0 model parameters in most tests
no specific indication for the choice among alternative Hx
What to choose as the new null hypothesis after rejection
Rejection usually just indicates a region for better parameter values
(like mean > 0 instead of mean = 0)
Lower/upper bound by parameters that result in rejection as H0
Confidence intervals as a result (specific to the sample, not to H0)
14
Type 1 error
Situation
H0 is true
The sample exhibits an extreme test statistic
H0 is therefore rejected
'Extreme' is a matter of opinion
Type 1 error is therefore
set by the investigator
=> α confidence level
15
Type 2 error
Situation
H0 is not true
By chance, the sample test statistic does not
classify as 'extreme' under the null hypothesis
H0 is therefore not rejected
Type 2 error occurrence is usually the result of
the α level and the assumptions about H1
16
Alternatives to the H0 tests
17
To do list
Acknowledge if you are still not sure
Be aware of the assumptions that your H0 implies
Choose and justify your new null hypothesis in case of rejection
Do not chase rejection by
data selection
indiscriminate adjustment of your theory to the data
lowering the requirements (higher α level)
Explain what (no) rejection of H0 means in your setting
Make sure that your null hypothesis is not obviously wrong
18
Questions?
19
Conclusion
Hypothesis testing works as follows
1. Formulate a null hypothesis H0
2. Identify a test statistic
3. Compute the p-value
4. Compare p-value and α level
More data usually helps
No rejection ≠ no effect
Choice of the α level (type 1 error), indirect control only over type 2
No real alternative to H0 hypothesis testing
20
H0 formulation
TRIBE statistics course
Split, spring break 2016
Goal
Formulate the desired result
Formulate the desired result in a testable way
Meet the requirements for a meaningful H0 and H1
22
Scientific approach: make your statement testable
Replication by repetition
at least in theory (some datasets are hard to replicate)
H0 provides a benchmark for every new sample
The more samples, the more likely rejection occurs under the null
(also with a predictable and hence testable frequency)
Predictions
best result of a theory if they come true
stronger in a different setting (X variables outside the first sample)
For equally not rejected hypotheses, trust the more convincing story
23
Types of data stories (example online)
1. Change over time
2. Contrast
3. Drill down
4. Factors
5. Intersections
6. Outliers
7. Zoom out
8. …and more
24
Use existing work: statistics as a tool
'Standing on the shoulders of giants' (Isaac Newton)
Confirmation in a new setting
Country
Time
Topic
Extension of an existing model
Variables
Structure (parameters)
Error
Green field model
25
Model structure
X only
Correlation
Independence
Time series
X/Y
Form of the relationship (linear, logarithmic, etc.)
Parameters (number, flexibility, interaction)
Error distribution
Omitted variables of no or not enough relevance
26
Assumptions
Again
Model type
Correlations
Error term
How much do deviations from the assumptions hurt?
Check on the parameters by significance tests
Check on the error term by distribution and independence tests
Check on practical consequences by the explanatory content
Consequences in the real world?
27
Justification of the assumptions
Generally
Approximation, closeness confirmed by tests and words
Law of large numbers helps
Example: normally distributed sample means
High explanatory content helps
Error relatively unimportant
28
Interpretation of the model
Does the quantified version (= the model) represent the idea?
Interpretation error (misspecification)
Example face recognition (black persons ignored)
=> Algorithm may have looked for optically dark features on a bright
face while for some person the relation appears reverted
Something seemingly unusual
which is actually 'normal'
If X and/or Y are proxies, how are they linked to the ideal measure?
29
Admissible interpretation after significant results
H0 rejected
acknowledging that the result may be driven by chance
at the α level (no certainty)
up to alternative α levels equal to the p-value
Support for H1, indeed for any alternative not rejected when taken as H0
Inappropriate
H1 is true
Generalizing ('model is wrong' when only parameters are tested)
Any statement about the assumptions
30
The reality check
Appropriateness
Would H0 make sense?
Insight
Does H1 make sense?
Relevance
Does the rejection of the H0 change anyone's behavior?
31
Fix it in theory
New story
Transformation of
Y
effect of the explanatory variables on different aspects of Y
(absolute values, growth rate, etc.)
X
change of the relations between the explanatory variables
ε
as a consequence only
(the error term should not explain anything)
Transformations monotone in order to preserve the order
32
Fix it in practice
More data
Broader coverage
(application, geography, or time)
Clearer statistical results
(higher N)
Robustness
(more potential variables)
Predictive prior research results
(justification)
Story
(theoretical explanation)
33
Transferability
Transfer
geographically
outside the sample region of the x-variable
over time
to a technically analogous y-variable (similar behavior)
to an analogous y-variable in terms of content (similar explanation)
Valuable for predictions
Stability of the assumptions needed (model type, parameters, ε)
34
Data availability
Access
Awareness
Costs
Coverage
Format (tractability)
Extension later on
Permission to use, especially publication
Reliability
Size
Time
35
Simplification
Acceptable if the results are clear enough
Low p-value
High explanatory contents R2
All which is significant and relevant should be modeled
Unequally distributed (but correlated) outside factors lead to distortions
Parameters change with other variables even if p stays below α
36
Application on other data sets
Statistics as the lowest hurdle
Methods transferable
Assumptions and interpretations matter
Advantage when building on previous work
37
To do list
Acknowledge if the desired H0/H1 combination cannot be rejected
ex ante: no existing test for the prevailing configuration
ex ante: not the appropriate data available
ex post: not the required sample properties
Anticipate the distribution but not the realization of the sample
Be aware of the assumptions that your H0 implies (again)
Justify the α level you require
38
Questions?
39
Conclusion
Statistics of no help – H0 formulation is a purely conceptual process
Aim at H1 and choose H0 accordingly
Formulate H0 and H1, and only implementation & justification remain
Ask specific questions
Tests useful if you gain insights from (at least one possible) result
40
Mean comparison
TRIBE statistics course
Split, spring break 2016
Goal
Answer to 'Is there a difference on average?'
42
Why should we care about mean comparisons?
Usually meant when asking 'Is there a difference?'
Mean = expected (= 'true') value of the average
Applies to other statistics as well
Example: expected value of the variance
Basis for marginal effects in regressions
How much does the outcome change if the input increases by 1 unit
Easy application
43
Dice roller online (example)
Roll 1, roll 2, roll 52 – What tendency does the average have? How
likely seem the extreme realizations (all 1 or all 6)?
44
Approaching the normal distribution
Roll 1 die => uniform ('equal') distribution, often associated with 'fair'
Roll 2 dice => eyes sum up to unequally likely numbers: symmetric,
higher probability of realizations in the middle
More dice (n)
Distribution gets larger
Support proportional to n (distance from minimum to maximum)
(for finite distributions: no possible realization of +/-infinity)
Shape of the bulge proportional to √n (= volatility)
45
Law of large numbers
Message
Sample mean → µ for larger n
Imagine to sample N => average (=sample mean) and µ coincide
The higher n, the fewer possibilities to 'drive the average away from µ'
(not true strictly speaking in a distribution with infinite realizations)
Types
Strong Law of Large Numbers
Weak Law of Large Numbers (= Bernoulli's theorem)
Law of Truly Large Numbers
(a consequence, no mathematical law)
46
Law of large numbers online (example)
47
The Central Limit Theorem (statement)
Requires some mathematical expertise for a full appreciation
Almost always, the sample mean converges to µ (= true) for higher n
Application: the average of large samples is normally distributed
48
Central Limit Theorem (example)
49
CLT message
We can approximate the distribution of the sample mean arbitrarily well
by the normal distribution N(µ,σ)
no matter what the criterion for 'well' is (a bold statement)
no matter how the distribution of X looks (also a bold statement)
Only restriction: finite variance (and hence also existence of a mean)
Consequence
Complete distribution of the statistic (here: sample mean) known
Knowledge above despite limited information (only the sample)
about the underlying distribution
More data solves any problem (here)
50
Mean comparison thanks to the CLT
Likelihood assessment for the joint realizations of the sample results
Transformation possible to one statistic with a single distribution
Look at the difference µA - µB (H0: difference equals zero)
Independence (between the subsamples) helpful
Use of calculation rules for combined distributions
That way, one returns to the standard H0 testing
51
Standard test by reframing
Comparison of two distributions
Focus on 1 aspect (the mean)
2 subsamples have (potentially different means) => no clear H0
Solution by H0 = mean differences equal to zero
Setup of the test crucial (again)
Desired result with respect to content
Formulation of H0 and hence information about the test statistic
1-sided or 2-sided test for '≠, >, or <' according to the story
52
Mean comparison online (example)
53
Analysis of Variance
Mean
comparison
Source
of Variation
in EViews df
Sum of Sq.
Mean Sq.
Between
1
28161.76
28161.76
Quick Group Statistics Descriptive361
Statistics15001.05
Individual Samples
Within
41.55417
Choose the series, then in the group window View Tests of equality
Total
362
43162.82
119.2343
Test for Equality of Means Between Series
Date: 04/06/16 Time: 10:01
Sample: 1 10000
Included observations: 10000
df
Method
Value
Category Statistics
t-test
Satterthwaite-Welch t-test*
Anova F-test
Welch F-test*
361
328.5568
(1, 361)
(1, 328.557)
26.03290
25.96266
677.7121
674.0599
Probability
0.0000
0.0000
0.0000
0.0000
*Test allows for unequal cell variances
Analysis of Variance
Variable
HEIGHT_M
HEIGHT_F
All
df
Sum of Sq.
Between
Within
1
361
28161.76
15001.05
Total
362
43162.82
Mean
191.6971
173.8903
184.0937
Std. Dev.
6.395167
6.514288
10.91945
Source of Variation
Category Statistics
Variable
HEIGHT_M
HEIGHT_F
All
Count
208
155
363
Count
208
155
363
Mean Sq.
28161.76
41.55417
119.2343
Std. Err.
of Mean
0.443425
0.523240
0.573122
Mean
191.6971
173.8903
184.0937
54
Std. Dev.
6.395167
6.514288
10.91945
Std. Err.
of Mean
0.443425
0.523240
0.573122
Comparison with a fixed value
Test just a special case of the mean comparison
Mean of the second group equal to the fixed value
Standard deviation (and variance) of the second group equal to zero
55
Third factors
Improper conclusions possible
No similarity required as to the distribution of XA and XB separately
Independence across groups required in standard tests
Outside factors could drive the differences in the sample means
Solution
Eliminate (suspected) third factors by forming uniform subgroups
Incorporate additional effects => regression models like OLS
Choice depending on the story and intended message
56
More than 2 groups
All together
ANOVA = ANalysis Of VAriance
Decomposition of the observed variance to components that stem
from different sources of variation across the subgroups
works for mean comparison as well despite the name
Pairwise: as before
With other explanatory variables: regression
57
To do list
Reformulate your question in order to apply mean comparison
assumptions hardly needed (at the core)
distributional information about the test statistic as sine qua non
widely understood across audiences
Justify the assumptions if your test statistic exhibits a joint distribution
Think of third factors that could jointly influence your subgroups
58
Questions?
59
Conclusion
Mean comparisons are the typical research question
The law of large numbers roughly states that sample averages tend
towards the mean of the underlying distribution for larger samples
Zero difference between sample average and mean of the H0
in terms of expectations results from the LLN already
The central limit theorem roughly states that sample means exhibit
more and more a normal distribution as the sample gets larger
The CLT therefore provides an approximate full (!) distribution of the
sample average as a test statistic for hypothesis testing
Samples do not automatically contain all relevant information
=> The story still matters
60
Significance
TRIBE statistics course
Split, spring break 2016
Goal
Understand what happened in case of significance
Interpretation of statistical significance
See t-values and p-values as two sides of the same coin
62
Motivation
The art of applied statistics is
not getting a result – any method yields almost always a result
not the properties of the result – that depends on the data
to justify why you may interpret the data the way you do
Significant results build the quantitative basis for your story
Open question
How much can one interpret into the quantitative result?
63
Standard levels
***
significance at the 0.1% level
**
significance at the
1% level
*
significance at the
5% level
†
significance at the 10% level
Alternative meanings widespread
=> Use legends when reporting
64
What happened in the case of significance
Realization of a test statistic outside the (1-α) region
That is all
at the extreme(s) of the distribution for the sake of consistency
(otherwise, even more extreme outcomes would not imply rejection)
Bell-shape and only 1 dimension just for demonstration:
65
Chebyshev's inequality
Probability(│X-µ│ ≥ kσ) ≤ 1/k2 for k > 0
In words: The probability for the absolute distance of a realization to
the population mean to exceed k times the standard deviation is
never higher than the inverse of the squared value of k
Consequence of the limited surface available for a probability
density function (namely 1 = 100%)
Only assumes the existence of µ and σ
Consequence:
Lower & upper bounds for realized shares within k standard deviations
66
T-statistics
Reports how many (estimated) standard deviations away
from the H0 value an estimated parameter realizes
Often in standard tests, the estimated parameters follow a t-distribution
kind of a normal distribution broadened to fit finite samples
degrees of freedom = parameter for 'fat tails'
Tails shrink with more degrees of freedom (1 to infinity)
Leads to significance if it surpasses or falls short of some thresholds
67
P-value
The p-value
denotes the α level corresponding to
indifference between rejection and
no rejection of the H0
equals the maximum α level allowed
to still reject the H0
represents the likelihood of a H0 world
to produce a sample result as extreme
as the sample data
68
T-statistics or p-values?
Equivalent at full information (degrees of freedom) in appropriate
settings (t-distribution prevails)
Historical reason for the reporting of t-statistics: tables for t-distributions
=> experts familiar with the thresholds of common degrees of freedom
Arguments in favor of the p-value
Immediate precise comparison to ANY alpha level (no calculation),
but requires several positions after the separator at high significance
Independent of the distribution type of the test statistic
T-statistic implicitly given as well
Adapt to the journal practice, otherwise report p-values
(exact without auxiliary means)
69
Improper interpretation of significance
Proves…
the proposed model to be right/wrong
that there is an effect
…
Consequences (repetition)
H1 is true
Generalizing ('model is wrong' when only parameters are tested)
Any statement about the assumptions
70
Interpretation of insignificant results
No significance ≠ no effect
Bad luck with the present sample
Effect of a different type (model) or size (parameter) than H1
Type 2 error (failure to reject a false H0 hypothesis)
No probability statement about the alternative hypotheses
(distributional assumptions only valid for the H0)
71
Interpretation of significant results
H0 rejected
about the test statistic, not necessarily the whole setting
at the α level and up to alternative α levels equal to the p-value
Support for H1 or any Hx not rejected when taken as H0
For single elements
model type not tested (possible indication via explanatory content)
parameters most likely the estimated ones from the sample
(for standard null hypotheses)
Story behind may set the new null at slightly different parameters
(usually round ones like 1 instead of 0.997)
72
A related concept: Value at Risk
Value on the x-axis that delimits the α region
VaR needs an ordered outcome, an α level, plus a time horizon
Key figure in risk management
Alternative: Profit at Risk
= absolute distance of the VaR threshold to the expected value
73
The lure of highly reliable tests
Message: 'We get over 99.99% right'
Easy when the incidence rate is very low
Even trivial tests can accomplish that
Example
Test on 'Identical name as myself?'
among the world population
=> Already a plain 'No' (= no actual test
at all) gets many correct results
=> Important to know what any quality label
refers to exactly
74
Size matters
Effect existence often anticipated
from the beginning (research plan)
Actual question often
How large is a particular effect?
How sure are we about this size?
(σ of the parameter)
How much is explained (R2)?
Interpretation relies on
significance
size
relevance
75
Wording
'Significant' reserved for test results
Without test, use
considerable
substantial
…
Avoid 'extraordinary' and the like because it implies an H0
which is neither formulated nor tested
76
Assumptions again
Significance results
from a model
including assumptions
confronted with sample data
Essentially, the assumed distribution of the error term in the model
determines the distribution of the estimated parameters
and hence the incidence of significance
Significant results do not make up for a badly set up model
77
To do list
Justify your α level
Use p-values instead of t-statistics
Use 'significant' only after tests
Verify ex ante that you could make sense out of significant results
Optional homework: develop a hypothesis how one variable in your
data might explain (or even cause) another one
78
Questions?
79
Conclusion
Significance is the driver of almost all quantitative statements
beyond descriptive statistics
using a test
and comes with an α level attached
Levels usually labeled by asterisks, no universal standard
Sample size makes results more reliable
Interpretation depends on what is effectively measured and tested
One's 'statistically significant other' exhibits special characteristics
along at least one dimension – However, this alone does not make
him or her necessarily 'the one'
80