Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction TRIBE statistics course Split, spring break 2016 Goal 'Make them love statistics!' 3 methods to apply Tests for differences (χ2 and mean comparison) Ordinary least squared regression (OLS) Error analysis Interest in and knowledge about statistics for biomedical research Practical application in EViews/Excel Preparation, analysis, interpretation, and presentation of data (basics) 2 Setting Doctoral students in biomedicine Limited and diverse statistical background More understanding, less theory Practical application Support of individual work with data for the thesis First try as an experiment 1 week during the spring break Teaching by means of lectures with applications in Excel/EViews Classes according to schedule on short notice No test 3 Agenda Modules 1. Preparation Master standard descriptive statistics and deduce tender points thereof 2. Analysis Formulate your problem statement and understand 'significance' 3. Interpretation Discern correlation and causality and set up a meaningful OLS model 4. Visualization Use the power of depiction to make your point Structure Lecture, practical work, discussion Homework Optional 4 Lecturer 5 Audience About you: Name Research topic Data use yes/no Method for the analysis 1 expectation for this course …and more 6 Our first data set: height/weight We need Dimension x-axis, y-axis Direction in which values increase Units in order to measure differences First questions Precision needed/possible? Truthfulness/bias? Properties of either dimension? Relation? Explanation? 7 Statistics versus …metrics Statistics 'Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.' More than mathematical methods for the treatment of data Focus on analytical methods and properties …metrics …metrics bridge the gap between statistical methods and practical interests within a specific area Still, …metricians develop their own methods along with statisticians Econometrics focus on time series and causality How do you call your field of …metrics? 8 Our sample data set Plus alternatively your own data 9 Data retrieval Access Collection Search Validation (external) Backup of the original data set including date and source = raw data Websites as .mht files including link and date Rights and sensitivity Import to Excel / EViews / …, revision => Your data set 10 Fast run through all 11 Module 1 Realization & descriptive statistics: moments, structure, correlation Estimated underlying distribution: probability and cumulative density Distribution tests: expected versus actual realizations 12 Module 2 Hypothesis testing: potential and limitation of statistics plus significance Problem statement: formulating the desired result in a testable way Central limit theorem: the magic of normally distributed sample means 13 Module 3 Correlation: causality from content, not statistics Linear regression: standard ordinary least squares (OLS) Error term: model change and transformations for ideal characteristics 14 Module 4 From data to visualization Message specific to the audience Review 15 Not covered this time Finite sample properties Unbiasedness, consistency, efficiency, distribution Some rule of thumb minima, though Survival Analog principle as in OLS regression and hypothesis testing Time series Autocorrelation (Conditional) heteroscedasticity Regimes 16 In between During class Listen Process data Formulate questions Replicate with your data Surf the web to the links from the slides Outside class Between lecture hours: have a break and/or discuss Between classes: get an appointment For the next day: 1 optional homework 17 Afterwards Nothing mandatory Feedback to each other to the lecturer to the program manager(s) Contact the lecturer with ideas or questions 18 Questions? 19 Conclusion Meaningful statistical tests base on assumptions => You need to know something about the topic discussed Statistics work only with a question behind => Formulate what you would like to demonstrate Replication (of statistics, not data) = imperative => Keep a backup of your complete raw data with date and source 20 Descriptive statistics TRIBE statistics course Split, spring break 2016 Goal Understand that descriptive statistics are about SAMPLE properties Use descriptive statistics for validation of the data Understand descriptive output (example online): 22 Sample size More is better Subsets of N N used in the reported statistics (complete information, revised) N of the sample (= raw data) N of the population Difference Data representative? Direction of the bias? Generalization of the results admissible? 23 Moments Expectation of Xk = E[Xk] = knd moment Estimator in samples = the (unweighted arithmetic) average of Xk Moments with names: 1. Mean 2. Variance (standard deviation for its square root) 3. Skewness 4. Kurtosis Any moment E[Xn] (in some cases) by moment generating functions 24 Mean For a population = the expected value, sample estimator = average 25 Variance σ2 Variance = the average of the squared differences from the mean First measure of average spread in a distribution First measure of uncertainty Part of the 'family' of moments Standard deviation (= square root of the variance) Kind of average deviation Same unit as the data Squaring the individual distances to the mean avoids cancelling of positive and negative ones plus marks unequal deviations as 'larger', see: 26 Existence of moments Distributions without (some) moments exist No mean: Cauchy distribution No variance: some t-distributions All distributions with a defined variance also have a defined mean For many distributions, a formula for the moments exists: ANY sample has sample mean and variance (in short, all moments) 27 Data requirements Metric variables necessary for most descriptive statistics to make sense – often implicitly assumed or approximated by the according interpretation of adjacent categories Ordinal variables (like rankings): distances with less or no meaning Nominal (or categorical) data Often transformed to dummies (value of 0 or 1) More than two categories can be captured by more dummies Dummies allow a quantitative distinction of effects 28 Median Median = 50% quantile Skewness matters Symmetry median = mean Standard use of quantiles for Data with extreme outliers Income/wealth statements Votes (majority rules) Still another figure: the mode 29 Data structure Boundaries Minimum Maximum Range = maximum minus minimum Outliers No universal definition Rule of thumb: more than 2-3 standard deviations away Limited number when defined by σ since the probability of realizations within k times σ decreases at least quadratically in k (Chebyshev's inequality: Probability(│X-µ│ ≥ kσ) ≤ 1/k2 for k > 0) 30 Validation of the data import Frequent mistakes Blanks as zeros Format: decimal separators (.↔ ,), 12'345↔12.345, numbers as text Percentage versus percentage points Simple quality check: compare expectations to realizations Complete data: serious collection or the contrary Numbers: always the same entries, extreme values, sums Surveys: enforced answers, strategy, wrongly understood questions 31 Precision Measure height to meters instead of centimeters => all 2 meters tall Required precision relates to the question asked Different levels of precision complicate the replication of results Highly precise, some statistics imply more information than available (e.g. a median height of 185.00 cm when data is in cm only) 32 Missing data Reduces the available data set for analyses that rely on all information Fewer data points need clearer outcomes for significant results Handling easiest with specialized software Potential (and often likely) bias at the omissions Treatment Data samples large enough taking into account some missing data Do not replace missing output by a model prediction (no gain but spuriously reduced variance due to the assumed zero error) Possibility to replace missing data in an input matrix (but then correlation matters) 33 Data structure Sorting A-Z usually okay, sorting by time not (autocorrelation) Helpful to get rid of missing data, no need in statistical software If at all, then for all variables equally (for cross-variable relations) Expansion of the data set (additional variables, often dummies) Beware of implicit assumptions ('A + B = Total': maybe there is a C) Explanatory content (also non-linear) by construction Keep track of the construction 34 Histogram 40 Series: HEIGHT Sample 1 10000 Observations 363 30 20 Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis 184.0937 185.0000 205.0000 158.0000 10.91945 -0.149977 2.065546 Jarque-Bera Probability 14.56803 0.000686 10 0 160 165 170 175 180 185 190 195 200 205 In EViews: Group/Series Descriptive Statistics &Tests Histogram and Stats 35 Box plot Information about the distribution Whiskers show the range to the farthest point still not an outlier No standard for the (far) outliers EViews uses 1.5 (respectively 3) times the interquartile range In EViews: Group/Series View Graph Boxplot HEIGHT_M HEIGHT_F 155 160 165 170 175 180 185 190 195 36 200 205 210 Correlation Descriptive over more than one series, also across dimensions First measure of 'connection' between univariate data Usually stated in terms of linear correlation Basis for regressions Autocorrelation (especially for time series) over one or many periods ρ for the population, r for the sample: ρ higher than 80% considered as strong, below 50% as weak 37 To do list Keep the original raw data Design the data collection with reserves for potentially missing data Use descriptive statistic to validate the sample data Refrain from 'obvious' improvements (categories, rounding, sorting) Maximum precisions in the calculations (as long as it does not slow down the process by too much), reasonable one in the presentation Attention at further transport or transformation, for example: Date difference EViews/Excel = 693593 Date '0' in EViews means 01Jan0001 Date '0' in Excel means 01Jan1900 38 Questions? 39 Conclusion Tender points of the descriptive statistics (sensitivity to outliers) Misdirection when drawing upon the wrong descriptive statistics Like any software, statistical packages take some time to accustom Complete data can be a sign of good or bad quality Rather use robust test methods than trying to fix the data Large samples reduce the issues of bias and missing data Expectations about the descriptive statistics form a first hypothesis that is 'tested' by eye inspection for the realized values of the data 40 Underlying distribution TRIBE statistics course Split, spring break 2016 Goal What do we think we have in general (and not just the sample)? Figure out what SHOULD be there (and with which likelihood) Realize that a known distribution does not imply certain outcomes 42 Sample distribution (histogram) 'Where do we have how many realizations along each dimension?' Example of a discrete probability function With no assumptions, this is the most likely underlying distribution 40 Series: HEIGHT Sample 1 10000 Observations 363 30 20 Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis 184.0937 185.0000 205.0000 158.0000 10.91945 -0.149977 2.065546 Jarque-Bera Probability 14.56803 0.000686 10 0 160 165 170 175 180 185 190 195 200 43 205 Discrete probability function From histogram to distribution x-axis = dimension of the realization (cm, kg, €, …) y-axis = probability of the realization on the x-axis at this point Standardization to an area of 1 (= 100%), also for comparison Properties of standardized histograms as probability functions Any surface of size 100% corresponds to a distribution Height limited by the width of the category (100% maximum surface) No 'left/right' boundaries necessary (but probability goes to 0) 44 Continuous function when n increases For sufficient precision at the measurement, the steps disappear with n Example: measure length in km, meters, cm, mm, µm, … Not the case for truly discrete functions: Coin toss, roll dice, lotteries Smooth development allows approximation by continuous functions Mapping Data implies function and function implies data Analytical representation of continuous distribution functions Approximation of discrete by continuous facilitates calculations (fewer parameters, predictions for the full support, smoothness) 45 Continuous versus discrete distribution No continuous distribution in ANY sample (white necessarily a finite number and precision) Hence always the question 'close enough as an approximation?' Classical approximation: the normal distribution (= Gaussian curve) 46 Full distribution Probability density function and cumulative probability function D(x) = integral of P(x), often available as explicit function (analytical solutions, easy calculation) Indicates the likelihood of realizations within ANY interval Normal distribution: Sometimes, less information (like mean or variance) suffices 47 Distribution versus realization Exact distribution (binomial, coin, dice, normal, …) ≠ sure realization Real life realization = data set 48 Calculation rule mean E[a∙X + b∙Y + c] = a∙E[X] + b∙E[Y] + c a,b,c constants; X,Y stochastic; E[∙] as the operator for expectations 49 Calculation rules (co)variance Variance of X = Var(X) = E[(X - E[X])2] Var(a∙X + b) = a2∙Var(X) Var(a∙X ± b∙Y + c) = a2∙Var(X) + b2∙Var(Y) ± 2∙a∙b∙Cov(X,Y) Covariance of X and Y = Cov(X, Y) = E[(X - E[X])(Y - E[Y])] Cov(X, X) = Var(X), Cov(a, X) = 0, Cov(a∙X, b∙Y) = a∙b∙Cov(X, Y) Cov(X+Y, Z) = Cov(X, Z) + Cov(Y, Z) for stochastic Z Variance adds up in a n-step combination (for example over time) => Volatility (= standard deviation σ) increases with √t over time 50 Estimated underlying distribution Generally: (corrected) average as the estimator for the expected value First and second moment Average of the sample = estimator for the mean Sample variance with n-1 correction for bias Correction needed since the sample average leads to the lowest possible variance but is not necessarily the true mean Properties (unbiasedness, consistency, efficiency, distribution) of alternative estimators not considered here 51 Optimization: mathematics versus preferences 'Better' along one dimension in most cases easy to define With several dimensions, however, tradeoffs arise => preferences Better fit with more parameters Reasons for choosing distributions with few parameters Fewer or no tradeoffs between the effects of the single parameters Lack of data points none of the arguments since usually n >> k Smooth (behavior at the extremes, calculation, comparability) 52 Measurement errors 'Wer misst, misst Mist' (Who measures, measures rubbish) Precision: ex post seemingly precise data which is really rounded No protection (by tests), but eye inspection might help Data out of the allowed/expected range Suspicious frequencies, patterns, or repetitions Knowledge about the topic crucial 53 Measurement errors, qualitatively Implicitly supposed (often linear) correlations like 'money makes happy' Wealth instead of satisfaction (and the origin may matter as well) Income instead of wealth Average income instead of individual ones 54 Bias Experimental setting: Truly equal conditions? Selection bias (participation) Willingness to share (political correctness, wealth, etc.) Same consequences as in real life? Opposing directions towards 'better': Expect 159cm/60kg from a person reporting 160/59, to put it mildly Again: expectations, not realization The measurement unit may matter itself: Financial markets with floors and ceilings at levels with 'round' numbers 55 Outliers Skip or praise? 56 Uncertainty about the distribution Already 1 realization can exclude some distributions Distribution tests for the data and the estimated errors of a model Larger samples mitigate uncertainty about and within a distribution Most often replaced by (implicit) assumptions (like µ=0) Often, the assumptions are not explicitly stated or motivated 'Theoretical results in econometrics rely on assumptions/conditions that have to be satisfied. If they're not, then don't be surprised by the empirical results that you obtain.' (Dave Giles) 57 To do list Assume a distribution for your theory that fits your story This sounds like cheating, but it drives the behavior of test statistics in data samples that are used to confirm or reject the theory Be aware of the variety and properties of alternative distributions Gallery of distributions Probability, Mathematical Statistics, Stochastic Processes Statlect – The Digital Textbook Keep track of your (implicit) assumptions (mean, support, etc.) 58 Questions? 59 Conclusion Theoretical distributions provide complete quantile information Closeness of the approximation matters Properties of combined distributions deductible Still, exact distribution ≠ sure realization Measurement errors and unrecognized bias may annihilate the results Outliers More likely the result of mistakes than standard data points Still, some should exist for standard distributions in large samples Disproportional influence on most estimations (least squares) 60 Distribution tests TRIBE statistics course Split, spring break 2016 Goal Be able to check whether a sample distribution fits a theoretical one Know the difference between distribution and independence test Understand that there are alternatives to the discussed tests 62 Actual versus expected categorical distribution Realization in categories (also called bins in this setting) Free number of categories Free size, even unequal ones Several dimensions at the same time possible Prior expectations about the outcome within a sample Contrast with the actual sample Differences can be random Differences can arise because the expectations were wrong 63 H0 rejection: a glimpse ahead Null hypothesis = assume a distribution for the stochastic variable X Code H0 Alternatives (usually just one) H1, H2, … As a consequence, test statistics (like the sample mean) exhibit also a certain distribution Statistical tests assess how likely the sample outcome is under the null (p-value) consider extremes on both or only on one side (1- or 2-sided tests) reject the null if the sample exhibits 'too extreme' properties 64 Distribution comparisons: the principle (χ2) The (assumed) underlying distribution determines the expected number of realizations in each bin increases these numbers proportionally with the sample size n fixes thus the chances for each single bin => binomial distributions Differences are to be expected follow a χ2 distribution when squared and added up rarely exceed certain threshold levels (if H0 is true) 65 χ2 tables online 66 Independence test (χ2) Identical underlying distributions must lead to similar samples Differences are not extraordinary, that is the nature of stochastics Any realization above the expectation leaves less in another bin Each dimension of categories is not independent of the total Number of categories and (potentially) independent variables More bins mean more chances for deviation Effect captured by the degrees of freedom of the χ2 distribution #Degrees of freedom = (#columns-1)(#rows-1) 67 χ2 test for independence online (example) 68 Independence in our test sample Actual 150-170 cm 170-190 cm 190-210 cm Total 0 90 118 52 103 0 52 193 118 208 155 363 Expected (in case of independence) 150-170 cm 170-190 cm 190-210 cm Total Men 29.7961433 110.589532 67.6143251 Women 22.2038567 82.4104683 50.3856749 Total 52 193 118 208 155 363 Men Women Total Squared deviations 150-170 cm 170-190 cm Men 887.810153 423.928815 Women 887.810153 423.928815 Total 1775.62031 847.85763 190-210 cm 2538.71624 2538.71624 5077.43248 69 Total 3850.45521 3850.45521 7700.91041 Test 2 3 2 5% 7700.91041 5.99146455 Rows Columns Degrees of freedom Significance level Test statistic Critical value 0 p value Distribution tests Difference to the independence test Comparison to an assumed (not sampled) distribution Same principle in general: not 'too extreme' differences Same principle for χ2 tests with bins The assumed distribution can be the one of the total sample. Then subgroups must be tested separately against this H0 rejection occurs less frequently the larger a subgroup because it dominates the sample (and hence its distribution) more and more one often rather choses the distribution of the largest group as H0 70 Distribution test: 1 dimension Are the players' birthday equally distributed over the 12 months? Pro Birthday does not matter for current performance in sports Relevant effects offset each other (= terrible explanation) … Contra Earlier born persons with an advantage in their teenage peer group Astrology … 71 Precision of the hypothesis matters In the example of uniform distribution of birthdays of the 12 months 12 equal bins for the calendar months do not consider the diverse length of the months from 28 to 31 days Leap years occur Effect of the relative phase of leap years (How many are relevant?) Be careful What do the variable measure? What does the H0 specify? What doe you test exactly? 72 Expansion to several dimensions Joint distribution = distribution of several simultaneous outcomes Calculation rules analogously to the ones for one dimension (additive constants and means for independent distributions etc.) Dependence matters Test works analogously: subgroups as additional categories 73 Single elements versus the whole distribution Test for a specific distribution with bins almost always a failure since the data would have to match ANY bin Faster exclusion by rejecting the hypothesis that (a combination of) single characteristics of the sample matches those of a distribution Natural candidates Test for equal (also fixed) mean Comparison of moments in general Third and fourth moment (skewness and kurtosis) in Jarque-Bera 74 Jarque-Bera Comparison of (the third and fourth) moments with those of the Test statistic = (n/6)∙(S2 + (K-3)2/4) n = sample size, S = skewness, K = Kurtosis Test statistic asymptotically χ2 distributed with 2 degrees of freedom Alternatives Shapiro-Wilk (linear models) Kolmogorow-Smirnow (continuous functions) Anderson-Darling (modification of Kolmogorow-Smirnow) 75 Distribution tests in EViews SeriesViewDescriptive Statistics&TestsEmpirical distribution test Empirical Distribution Test for HEIGHT Hypothesis: Normal Date: 04/04/16 Time: 11:05 Sample (adjusted): 1 364 Included observations: 363 after adjustments Method Value Lilliefors (D) Cramer-von Mises (W2) Watson (U2) Anderson-Darling (A2) 0.072849 0.492771 0.482381 2.867359 Adj. Value Probability NA 0.493450 0.483046 2.873333 0.0001 0.0000 0.0000 0.0000 Method: Maximum Likelihood - d.f. corrected (Exact Solution) Parameter Value Std. Error z-Statistic Prob. MU SIGMA 184.0937 10.91945 0.573122 0.405818 321.2118 26.90725 0.0000 0.0000 Log likelihood No. of Coefficients -1382.343 2 Mean dependent var. S.D. dependent var. 76 184.0937 10.91945 Standardization and transformation Standardization (only changes σ, not the type of the distribution transformed x variables labeled as z = (x - µ)/σ z denotes how many standard deviations away from the mean x is makes deviations comparable across different dimensions Transformation (to match a specific distribution better) changes the interpretation of the variable compresses/stretches the distribution unequally with respect to X must preserve the order to make sense 77 To do list Carefully specify your research question (which may change) Make sure that you know what you test Do not confuse sample distribution and underlying distribution Standardized values to compare distributions of different dimensions Optional homework: prepare your own data for use in EViews/Excel 78 Questions? 79 Conclusion Independence tests check for equal distributions among subsamples Distribution tests check whether the sample distribution matches a specific distribution well enough What is compared matters absolute values or z-scores number of dimensions and bins precision of the hypothesis Standardization allows comparisons across dimensions Transformation allows closer approximation to a desired distribution 80