Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hypothesis testing TRIBE statistics course Split, spring break 2016 Goal Concept of the null hypothesis H0 Know the procedure at hypothesis testing Error of type 1 (false positive) and 2 (false negative) 2 When to take action Auric Goldfinger (in James Bond 'Goldfinger'): 'Mr. Bond, they have a saying in Chicago: Once is happenstance. Twice is coincidence. The third time it is enemy action.' 3 Trouble with knowledge From which certainty level on do you claim to know rather than believe? Maybe never => hardly any knowledge at all Personal choice => preferences matter Varying personal α level in different settings: dice versus lottery In statistics Only negative results for sure Results for all α levels by the threshold given by the p-value Never sure despite statistically significant results 4 Standard procedure 1. Formulate a null hypothesis H0 2. Identify a test statistic 3. Compute the p-value 4. Compare p-value versus α level 5 The H0 world Virtual world: omission of everything unnecessary Model: connections between variables, distributions, parameters, ε Not necessarily wrong: else, rejecting it hardly an achievement 6 Model with no error = a definition Examples Kelvin versus °C linear with slope 1 Fahrenheit versus °C linear Variance versus standard deviation quadratic Measurement errors still possible 7 The falsification principle for H0 An outcome (of a test statistic) in the sample which is too extreme (= less likely ex ante than α in percent) leads to a rejection of the null hypothesis If you live in a H0 world, you wrongly reject the null in α (in percent) of independent samples 8 Statistics can prove something wrong Absolute Realizations outside the distribution like 7 at standard dice With any (freely chosen) degree of conviction but never certainty Realizations that would have been unlikely ex ante (corresponds to the standard hypothesis testing) Failure Wrong decision about the null hypothesis of due to random effects (errors of type 1 and 2) 9 Statistics cannot prove something to be correct Without (model or measurement) errors, there is no need for statistics With errors, the result could (almost always) result from those (depending on the possible outcomes of the error under the null) Even if the sample outcome is 'likely' under the null hypothesis, it could truly result from another distribution, and this other distribution must satisfy no other condition than assigning a positive probability to the outcome in the sample => Prove, no – Support, yes 10 1-sided versus 2-sided tests Choice depends on the alternative to H0 If you suspect that the true value of your test statistic exceeds the average of this test statistic under the null hypothesis, a relatively low value in the sample does not support your alternative H1 Once the direction of the deviation is given by the sample, of course the 2-sided test sets a stricter threshold for the rejection of the H0 Story matters ex ante, not only ex post (otherwise, the choice of 'only' a 1-sided test might be considered as fiddling) 11 Hypothesis testing calculator (example) 12 Hypothesis testing in EViews SeriesViewDescriptive Statistics & TestsSimple hypothesis tests Hypothesis Testing for HEIGHT Date: 04/05/16 Time: 17:41 Sample (adjusted): 1 364 Included observations: 363 after adjustments Test of Hypothesis: Mean = 183.0000 Sample Mean = 184.0937 Sample Std. Dev. = 10.91945 Method t-statistic Value 1.908256 13 Probability 0.0571 Limits to H1 None in principle but statements only possible about the sample in relation to H0 H1 bound to changes in the H0 model parameters in most tests no specific indication for the choice among alternative Hx What to choose as the new null hypothesis after rejection Rejection usually just indicates a region for better parameter values (like mean > 0 instead of mean = 0) Lower/upper bound by parameters that result in rejection as H0 Confidence intervals as a result (specific to the sample, not to H0) 14 Type 1 error Situation H0 is true The sample exhibits an extreme test statistic H0 is therefore rejected 'Extreme' is a matter of opinion Type 1 error is therefore set by the investigator => α confidence level 15 Type 2 error Situation H0 is not true By chance, the sample test statistic does not classify as 'extreme' under the null hypothesis H0 is therefore not rejected Type 2 error occurrence is usually the result of the α level and the assumptions about H1 16 Alternatives to the H0 tests 17 To do list Acknowledge if you are still not sure Be aware of the assumptions that your H0 implies Choose and justify your new null hypothesis in case of rejection Do not chase rejection by data selection indiscriminate adjustment of your theory to the data lowering the requirements (higher α level) Explain what (no) rejection of H0 means in your setting Make sure that your null hypothesis is not obviously wrong 18 Questions? 19 Conclusion Hypothesis testing works as follows 1. Formulate a null hypothesis H0 2. Identify a test statistic 3. Compute the p-value 4. Compare p-value and α level More data usually helps No rejection ≠ no effect Choice of the α level (type 1 error), indirect control only over type 2 No real alternative to H0 hypothesis testing 20 H0 formulation TRIBE statistics course Split, spring break 2016 Goal Formulate the desired result Formulate the desired result in a testable way Meet the requirements for a meaningful H0 and H1 22 Scientific approach: make your statement testable Replication by repetition at least in theory (some datasets are hard to replicate) H0 provides a benchmark for every new sample The more samples, the more likely rejection occurs under the null (also with a predictable and hence testable frequency) Predictions best result of a theory if they come true stronger in a different setting (X variables outside the first sample) For equally not rejected hypotheses, trust the more convincing story 23 Types of data stories (example online) 1. Change over time 2. Contrast 3. Drill down 4. Factors 5. Intersections 6. Outliers 7. Zoom out 8. …and more 24 Use existing work: statistics as a tool 'Standing on the shoulders of giants' (Isaac Newton) Confirmation in a new setting Country Time Topic Extension of an existing model Variables Structure (parameters) Error Green field model 25 Model structure X only Correlation Independence Time series X/Y Form of the relationship (linear, logarithmic, etc.) Parameters (number, flexibility, interaction) Error distribution Omitted variables of no or not enough relevance 26 Assumptions Again Model type Correlations Error term How much do deviations from the assumptions hurt? Check on the parameters by significance tests Check on the error term by distribution and independence tests Check on practical consequences by the explanatory content Consequences in the real world? 27 Justification of the assumptions Generally Approximation, closeness confirmed by tests and words Law of large numbers helps Example: normally distributed sample means High explanatory content helps Error relatively unimportant 28 Interpretation of the model Does the quantified version (= the model) represent the idea? Interpretation error (misspecification) Example face recognition (black persons ignored) => Algorithm may have looked for optically dark features on a bright face while for some person the relation appears reverted Something seemingly unusual which is actually 'normal' If X and/or Y are proxies, how are they linked to the ideal measure? 29 Admissible interpretation after significant results H0 rejected acknowledging that the result may be driven by chance at the α level (no certainty) up to alternative α levels equal to the p-value Support for H1, indeed for any alternative not rejected when taken as H0 Inappropriate H1 is true Generalizing ('model is wrong' when only parameters are tested) Any statement about the assumptions 30 The reality check Appropriateness Would H0 make sense? Insight Does H1 make sense? Relevance Does the rejection of the H0 change anyone's behavior? 31 Fix it in theory New story Transformation of Y effect of the explanatory variables on different aspects of Y (absolute values, growth rate, etc.) X change of the relations between the explanatory variables ε as a consequence only (the error term should not explain anything) Transformations monotone in order to preserve the order 32 Fix it in practice More data Broader coverage (application, geography, or time) Clearer statistical results (higher N) Robustness (more potential variables) Predictive prior research results (justification) Story (theoretical explanation) 33 Transferability Transfer geographically outside the sample region of the x-variable over time to a technically analogous y-variable (similar behavior) to an analogous y-variable in terms of content (similar explanation) Valuable for predictions Stability of the assumptions needed (model type, parameters, ε) 34 Data availability Access Awareness Costs Coverage Format (tractability) Extension later on Permission to use, especially publication Reliability Size Time 35 Simplification Acceptable if the results are clear enough Low p-value High explanatory contents R2 All which is significant and relevant should be modeled Unequally distributed (but correlated) outside factors lead to distortions Parameters change with other variables even if p stays below α 36 Application on other data sets Statistics as the lowest hurdle Methods transferable Assumptions and interpretations matter Advantage when building on previous work 37 To do list Acknowledge if the desired H0/H1 combination cannot be rejected ex ante: no existing test for the prevailing configuration ex ante: not the appropriate data available ex post: not the required sample properties Anticipate the distribution but not the realization of the sample Be aware of the assumptions that your H0 implies (again) Justify the α level you require 38 Questions? 39 Conclusion Statistics of no help – H0 formulation is a purely conceptual process Aim at H1 and choose H0 accordingly Formulate H0 and H1, and only implementation & justification remain Ask specific questions Tests useful if you gain insights from (at least one possible) result 40 Mean comparison TRIBE statistics course Split, spring break 2016 Goal Answer to 'Is there a difference on average?' 42 Why should we care about mean comparisons? Usually meant when asking 'Is there a difference?' Mean = expected (= 'true') value of the average Applies to other statistics as well Example: expected value of the variance Basis for marginal effects in regressions How much does the outcome change if the input increases by 1 unit Easy application 43 Dice roller online (example) Roll 1, roll 2, roll 52 – What tendency does the average have? How likely seem the extreme realizations (all 1 or all 6)? 44 Approaching the normal distribution Roll 1 die => uniform ('equal') distribution, often associated with 'fair' Roll 2 dice => eyes sum up to unequally likely numbers: symmetric, higher probability of realizations in the middle More dice (n) Distribution gets larger Support proportional to n (distance from minimum to maximum) (for finite distributions: no possible realization of +/-infinity) Shape of the bulge proportional to √n (= volatility) 45 Law of large numbers Message Sample mean → µ for larger n Imagine to sample N => average (=sample mean) and µ coincide The higher n, the fewer possibilities to 'drive the average away from µ' (not true strictly speaking in a distribution with infinite realizations) Types Strong Law of Large Numbers Weak Law of Large Numbers (= Bernoulli's theorem) Law of Truly Large Numbers (a consequence, no mathematical law) 46 Law of large numbers online (example) 47 The Central Limit Theorem (statement) Requires some mathematical expertise for a full appreciation Almost always, the sample mean converges to µ (= true) for higher n Application: the average of large samples is normally distributed 48 Central Limit Theorem (example) 49 CLT message We can approximate the distribution of the sample mean arbitrarily well by the normal distribution N(µ,σ) no matter what the criterion for 'well' is (a bold statement) no matter how the distribution of X looks (also a bold statement) Only restriction: finite variance (and hence also existence of a mean) Consequence Complete distribution of the statistic (here: sample mean) known Knowledge above despite limited information (only the sample) about the underlying distribution More data solves any problem (here) 50 Mean comparison thanks to the CLT Likelihood assessment for the joint realizations of the sample results Transformation possible to one statistic with a single distribution Look at the difference µA - µB (H0: difference equals zero) Independence (between the subsamples) helpful Use of calculation rules for combined distributions That way, one returns to the standard H0 testing 51 Standard test by reframing Comparison of two distributions Focus on 1 aspect (the mean) 2 subsamples have (potentially different means) => no clear H0 Solution by H0 = mean differences equal to zero Setup of the test crucial (again) Desired result with respect to content Formulation of H0 and hence information about the test statistic 1-sided or 2-sided test for '≠, >, or <' according to the story 52 Mean comparison online (example) 53 Analysis of Variance Mean comparison Source of Variation in EViews df Sum of Sq. Mean Sq. Between 1 28161.76 28161.76 Quick Group Statistics Descriptive361 Statistics15001.05 Individual Samples Within 41.55417 Choose the series, then in the group window View Tests of equality Total 362 43162.82 119.2343 Test for Equality of Means Between Series Date: 04/06/16 Time: 10:01 Sample: 1 10000 Included observations: 10000 df Method Value Category Statistics t-test Satterthwaite-Welch t-test* Anova F-test Welch F-test* 361 328.5568 (1, 361) (1, 328.557) 26.03290 25.96266 677.7121 674.0599 Probability 0.0000 0.0000 0.0000 0.0000 *Test allows for unequal cell variances Analysis of Variance Variable HEIGHT_M HEIGHT_F All df Sum of Sq. Between Within 1 361 28161.76 15001.05 Total 362 43162.82 Mean 191.6971 173.8903 184.0937 Std. Dev. 6.395167 6.514288 10.91945 Source of Variation Category Statistics Variable HEIGHT_M HEIGHT_F All Count 208 155 363 Count 208 155 363 Mean Sq. 28161.76 41.55417 119.2343 Std. Err. of Mean 0.443425 0.523240 0.573122 Mean 191.6971 173.8903 184.0937 54 Std. Dev. 6.395167 6.514288 10.91945 Std. Err. of Mean 0.443425 0.523240 0.573122 Comparison with a fixed value Test just a special case of the mean comparison Mean of the second group equal to the fixed value Standard deviation (and variance) of the second group equal to zero 55 Third factors Improper conclusions possible No similarity required as to the distribution of XA and XB separately Independence across groups required in standard tests Outside factors could drive the differences in the sample means Solution Eliminate (suspected) third factors by forming uniform subgroups Incorporate additional effects => regression models like OLS Choice depending on the story and intended message 56 More than 2 groups All together ANOVA = ANalysis Of VAriance Decomposition of the observed variance to components that stem from different sources of variation across the subgroups works for mean comparison as well despite the name Pairwise: as before With other explanatory variables: regression 57 To do list Reformulate your question in order to apply mean comparison assumptions hardly needed (at the core) distributional information about the test statistic as sine qua non widely understood across audiences Justify the assumptions if your test statistic exhibits a joint distribution Think of third factors that could jointly influence your subgroups 58 Questions? 59 Conclusion Mean comparisons are the typical research question The law of large numbers roughly states that sample averages tend towards the mean of the underlying distribution for larger samples Zero difference between sample average and mean of the H0 in terms of expectations results from the LLN already The central limit theorem roughly states that sample means exhibit more and more a normal distribution as the sample gets larger The CLT therefore provides an approximate full (!) distribution of the sample average as a test statistic for hypothesis testing Samples do not automatically contain all relevant information => The story still matters 60 Significance TRIBE statistics course Split, spring break 2016 Goal Understand what happened in case of significance Interpretation of statistical significance See t-values and p-values as two sides of the same coin 62 Motivation The art of applied statistics is not getting a result – any method yields almost always a result not the properties of the result – that depends on the data to justify why you may interpret the data the way you do Significant results build the quantitative basis for your story Open question How much can one interpret into the quantitative result? 63 Standard levels *** significance at the 0.1% level ** significance at the 1% level * significance at the 5% level † significance at the 10% level Alternative meanings widespread => Use legends when reporting 64 What happened in the case of significance Realization of a test statistic outside the (1-α) region That is all at the extreme(s) of the distribution for the sake of consistency (otherwise, even more extreme outcomes would not imply rejection) Bell-shape and only 1 dimension just for demonstration: 65 Chebyshev's inequality Probability(│X-µ│ ≥ kσ) ≤ 1/k2 for k > 0 In words: The probability for the absolute distance of a realization to the population mean to exceed k times the standard deviation is never higher than the inverse of the squared value of k Consequence of the limited surface available for a probability density function (namely 1 = 100%) Only assumes the existence of µ and σ Consequence: Lower & upper bounds for realized shares within k standard deviations 66 T-statistics Reports how many (estimated) standard deviations away from the H0 value an estimated parameter realizes Often in standard tests, the estimated parameters follow a t-distribution kind of a normal distribution broadened to fit finite samples degrees of freedom = parameter for 'fat tails' Tails shrink with more degrees of freedom (1 to infinity) Leads to significance if it surpasses or falls short of some thresholds 67 P-value The p-value denotes the α level corresponding to indifference between rejection and no rejection of the H0 equals the maximum α level allowed to still reject the H0 represents the likelihood of a H0 world to produce a sample result as extreme as the sample data 68 T-statistics or p-values? Equivalent at full information (degrees of freedom) in appropriate settings (t-distribution prevails) Historical reason for the reporting of t-statistics: tables for t-distributions => experts familiar with the thresholds of common degrees of freedom Arguments in favor of the p-value Immediate precise comparison to ANY alpha level (no calculation), but requires several positions after the separator at high significance Independent of the distribution type of the test statistic T-statistic implicitly given as well Adapt to the journal practice, otherwise report p-values (exact without auxiliary means) 69 Improper interpretation of significance Proves… the proposed model to be right/wrong that there is an effect … Consequences (repetition) H1 is true Generalizing ('model is wrong' when only parameters are tested) Any statement about the assumptions 70 Interpretation of insignificant results No significance ≠ no effect Bad luck with the present sample Effect of a different type (model) or size (parameter) than H1 Type 2 error (failure to reject a false H0 hypothesis) No probability statement about the alternative hypotheses (distributional assumptions only valid for the H0) 71 Interpretation of significant results H0 rejected about the test statistic, not necessarily the whole setting at the α level and up to alternative α levels equal to the p-value Support for H1 or any Hx not rejected when taken as H0 For single elements model type not tested (possible indication via explanatory content) parameters most likely the estimated ones from the sample (for standard null hypotheses) Story behind may set the new null at slightly different parameters (usually round ones like 1 instead of 0.997) 72 A related concept: Value at Risk Value on the x-axis that delimits the α region VaR needs an ordered outcome, an α level, plus a time horizon Key figure in risk management Alternative: Profit at Risk = absolute distance of the VaR threshold to the expected value 73 The lure of highly reliable tests Message: 'We get over 99.99% right' Easy when the incidence rate is very low Even trivial tests can accomplish that Example Test on 'Identical name as myself?' among the world population => Already a plain 'No' (= no actual test at all) gets many correct results => Important to know what any quality label refers to exactly 74 Size matters Effect existence often anticipated from the beginning (research plan) Actual question often How large is a particular effect? How sure are we about this size? (σ of the parameter) How much is explained (R2)? Interpretation relies on significance size relevance 75 Wording 'Significant' reserved for test results Without test, use considerable substantial … Avoid 'extraordinary' and the like because it implies an H0 which is neither formulated nor tested 76 Assumptions again Significance results from a model including assumptions confronted with sample data Essentially, the assumed distribution of the error term in the model determines the distribution of the estimated parameters and hence the incidence of significance Significant results do not make up for a badly set up model 77 To do list Justify your α level Use p-values instead of t-statistics Use 'significant' only after tests Verify ex ante that you could make sense out of significant results Optional homework: develop a hypothesis how one variable in your data might explain (or even cause) another one 78 Questions? 79 Conclusion Significance is the driver of almost all quantitative statements beyond descriptive statistics using a test and comes with an α level attached Levels usually labeled by asterisks, no universal standard Sample size makes results more reliable Interpretation depends on what is effectively measured and tested One's 'statistically significant other' exhibits special characteristics along at least one dimension – However, this alone does not make him or her necessarily 'the one' 80