Download 1 Descriptive information as the gauge for hypothetical distributions

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Introduction
TRIBE statistics course
Split, spring break 2016
Goal
'Make them love statistics!'
3 methods to apply
Tests for differences (χ2 and mean comparison)
Ordinary least squared regression (OLS)
Error analysis
Interest in and knowledge about statistics for biomedical research
Practical application in EViews/Excel
Preparation, analysis, interpretation, and presentation of data (basics)
2
Setting
Doctoral students in biomedicine
Limited and diverse statistical background
More understanding, less theory
Practical application
Support of individual work with data for the thesis
First try as an experiment
1 week during the spring break
Teaching by means of lectures with applications in Excel/EViews
Classes according to schedule on short notice
No test
3
Agenda
Modules
1. Preparation
Master standard descriptive statistics and
deduce tender points thereof
2. Analysis
Formulate your problem statement and
understand 'significance'
3. Interpretation
Discern correlation and causality and set up
a meaningful OLS model
4. Visualization
Use the power of depiction to make your point
Structure
Lecture, practical work, discussion
Homework
Optional
4
Lecturer
5
Audience
About you:
Name
Research topic
Data use yes/no
Method for the analysis
1 expectation for this course
…and more
6
Our first data set: height/weight
We need
Dimension
x-axis, y-axis
Direction
in which values increase
Units
in order to measure differences
First questions
Precision needed/possible?
Truthfulness/bias?
Properties of either dimension?
Relation?
Explanation?
7
Statistics versus …metrics
Statistics
'Statistics is the study of the collection, analysis, interpretation,
presentation, and organization of data.'
More than mathematical methods for the treatment of data
Focus on analytical methods and properties
…metrics
…metrics bridge the gap between statistical methods and practical
interests within a specific area
Still, …metricians develop their own methods along with statisticians
Econometrics focus on time series and causality
How do you call your field of …metrics?
8
Our sample data set
Plus alternatively your own data
9
Data retrieval
Access
Collection
Search
Validation (external)
Backup of the original data set including date and source = raw data
Websites as .mht files including link and date
Rights and sensitivity
Import to Excel / EViews / …, revision => Your data set
10
Fast run through all
11
Module 1
Realization & descriptive statistics: moments, structure, correlation
Estimated underlying distribution: probability and cumulative density
Distribution tests: expected versus actual realizations
12
Module 2
Hypothesis testing: potential and limitation of statistics plus significance
Problem statement: formulating the desired result in a testable way
Central limit theorem: the magic of normally distributed sample means
13
Module 3
Correlation: causality from content, not statistics
Linear regression: standard ordinary least squares (OLS)
Error term: model change and transformations for ideal characteristics
14
Module 4
From data to visualization
Message specific to the audience
Review
15
Not covered this time
Finite sample properties
Unbiasedness, consistency, efficiency, distribution
Some rule of thumb minima, though
Survival
Analog principle as in OLS regression and hypothesis testing
Time series
Autocorrelation
(Conditional) heteroscedasticity
Regimes
16
In between
During class
Listen
Process data
Formulate questions
Replicate with your data
Surf the web to the links from the slides
Outside class
Between lecture hours: have a break and/or discuss
Between classes: get an appointment
For the next day: 1 optional homework
17
Afterwards
Nothing mandatory
Feedback
to each other
to the lecturer
to the program manager(s)
Contact the lecturer with ideas or questions
18
Questions?
19
Conclusion
Meaningful statistical tests base on assumptions
=> You need to know something about the topic discussed
Statistics work only with a question behind
=> Formulate what you would like to demonstrate
Replication (of statistics, not data) = imperative
=> Keep a backup of your complete raw data with date and source
20
Descriptive statistics
TRIBE statistics course
Split, spring break 2016
Goal
Understand that descriptive statistics are about SAMPLE properties
Use descriptive statistics for validation of the data
Understand descriptive output (example online):
22
Sample size
More is better
Subsets of N
N used in the reported statistics (complete information, revised)
N of the sample (= raw data)
N of the population
Difference
Data representative?
Direction of the bias?
Generalization of the results admissible?
23
Moments
Expectation of Xk = E[Xk] = knd moment
Estimator in samples = the (unweighted arithmetic) average of Xk
Moments with names:
1. Mean
2. Variance (standard deviation for its square root)
3. Skewness
4. Kurtosis
Any moment E[Xn] (in some cases) by moment generating functions
24
Mean
For a population = the expected value, sample estimator = average
25
Variance σ2
Variance = the average of the squared differences from the mean
First measure of average spread in a distribution
First measure of uncertainty
Part of the 'family' of moments
Standard deviation (= square root of the variance)
Kind of average deviation
Same unit as the data
Squaring the individual distances to the mean
avoids cancelling of positive and negative ones
plus marks unequal deviations as 'larger', see:
26
Existence of moments
Distributions without (some) moments exist
No mean: Cauchy distribution
No variance: some t-distributions
All distributions with a defined variance also have a defined mean
For many distributions, a formula for the moments exists:
ANY sample has sample mean and variance (in short, all moments)
27
Data requirements
Metric variables necessary for most descriptive statistics to make sense
– often implicitly assumed or approximated by the according
interpretation of adjacent categories
Ordinal variables (like rankings): distances with less or no meaning
Nominal (or categorical) data
Often transformed to dummies (value of 0 or 1)
More than two categories can be captured by more dummies
Dummies allow a quantitative distinction of effects
28
Median
Median = 50% quantile
Skewness matters
Symmetry  median = mean
Standard use of quantiles for
Data with extreme outliers
Income/wealth statements
Votes (majority rules)
Still another figure: the mode
29
Data structure
Boundaries
Minimum
Maximum
Range = maximum minus minimum
Outliers
No universal definition
Rule of thumb: more than 2-3 standard deviations away
Limited number when defined by σ since the probability of
realizations within k times σ decreases at least quadratically in k
(Chebyshev's inequality: Probability(│X-µ│ ≥ kσ) ≤ 1/k2 for k > 0)
30
Validation of the data import
Frequent mistakes
Blanks as zeros
Format: decimal separators (.↔ ,), 12'345↔12.345, numbers as text
Percentage versus percentage points
Simple quality check: compare expectations to realizations
Complete data: serious collection or the contrary
Numbers: always the same entries, extreme values, sums
Surveys: enforced answers, strategy, wrongly understood questions
31
Precision
Measure height to meters instead of centimeters => all 2 meters tall
Required precision relates to the question asked
Different levels of precision complicate the replication of results
Highly precise, some statistics imply more information than available
(e.g. a median height of 185.00 cm when data is in cm only)
32
Missing data
Reduces the available data set for analyses that rely on all information
Fewer data points need clearer outcomes for significant results
Handling easiest with specialized software
Potential (and often likely) bias at the omissions
Treatment
Data samples large enough taking into account some missing data
Do not replace missing output by a model prediction (no gain but
spuriously reduced variance due to the assumed zero error)
Possibility to replace missing data in an input matrix (but then
correlation matters)
33
Data structure
Sorting
A-Z usually okay, sorting by time not (autocorrelation)
Helpful to get rid of missing data, no need in statistical software
If at all, then for all variables equally (for cross-variable relations)
Expansion of the data set (additional variables, often dummies)
Beware of implicit assumptions ('A + B = Total': maybe there is a C)
Explanatory content (also non-linear) by construction
Keep track of the construction
34
Histogram
40
Series: HEIGHT
Sample 1 10000
Observations 363
30
20
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
184.0937
185.0000
205.0000
158.0000
10.91945
-0.149977
2.065546
Jarque-Bera
Probability
14.56803
0.000686
10
0
160
165
170
175
180
185
190
195
200
205
In EViews:
Group/Series  Descriptive Statistics &Tests  Histogram and Stats
35
Box plot
Information about the distribution
Whiskers show the range to the farthest
point still not an outlier
No standard for the (far) outliers
EViews uses 1.5 (respectively 3) times
the interquartile range
In EViews:
Group/Series  View  Graph  Boxplot
HEIGHT_M
HEIGHT_F
155
160
165
170
175
180
185
190
195
36
200
205
210
Correlation
Descriptive over more than one series, also across dimensions
First measure of 'connection' between univariate data
Usually stated in terms of linear correlation
Basis for regressions
Autocorrelation (especially for time series) over one or many periods
ρ for the population, r for the sample:
ρ higher than 80% considered as strong, below 50% as weak
37
To do list
Keep the original raw data
Design the data collection with reserves for potentially missing data
Use descriptive statistic to validate the sample data
Refrain from 'obvious' improvements (categories, rounding, sorting)
Maximum precisions in the calculations (as long as it does not slow
down the process by too much), reasonable one in the presentation
Attention at further transport or transformation, for example:
Date difference EViews/Excel = 693593
Date '0' in EViews means 01Jan0001
Date '0' in Excel means 01Jan1900
38
Questions?
39
Conclusion
Tender points of the descriptive statistics (sensitivity to outliers)
Misdirection when drawing upon the wrong descriptive statistics
Like any software, statistical packages take some time to accustom
Complete data can be a sign of good or bad quality
Rather use robust test methods than trying to fix the data
Large samples reduce the issues of bias and missing data
Expectations about the descriptive statistics form a first hypothesis
that is 'tested' by eye inspection for the realized values of the data
40
Underlying distribution
TRIBE statistics course
Split, spring break 2016
Goal
What do we think we have in general (and not just the sample)?
Figure out what SHOULD be there (and with which likelihood)
Realize that a known distribution does not imply certain outcomes
42
Sample distribution (histogram)
'Where do we have how many realizations along each dimension?'
Example of a discrete probability function
With no assumptions, this is the most likely underlying distribution
40
Series: HEIGHT
Sample 1 10000
Observations 363
30
20
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
184.0937
185.0000
205.0000
158.0000
10.91945
-0.149977
2.065546
Jarque-Bera
Probability
14.56803
0.000686
10
0
160
165
170
175
180
185
190
195
200
43
205
Discrete probability function
From histogram to distribution
x-axis = dimension of the realization (cm, kg, €, …)
y-axis = probability of the realization on the x-axis at this point
Standardization to an area of 1 (= 100%), also for comparison
Properties of standardized histograms as probability functions
Any surface of size 100% corresponds to a distribution
Height limited by the width of the category (100% maximum surface)
No 'left/right' boundaries necessary (but probability goes to 0)
44
Continuous function when n increases
For sufficient precision at the measurement, the steps disappear with n
Example: measure length in km, meters, cm, mm, µm, …
Not the case for truly discrete functions: Coin toss, roll dice, lotteries
Smooth development allows approximation by continuous functions
Mapping
Data implies function and function implies data
Analytical representation of continuous distribution functions
Approximation of discrete by continuous facilitates calculations
(fewer parameters, predictions for the full support, smoothness)
45
Continuous versus discrete distribution
No continuous distribution in ANY sample
(white necessarily a finite number and precision)
Hence always the question 'close enough as an approximation?'
Classical approximation: the normal distribution (= Gaussian curve)
46
Full distribution
Probability density function and cumulative probability function
D(x) = integral of P(x), often available as explicit function
(analytical solutions, easy calculation)
Indicates the likelihood of realizations within ANY interval
Normal distribution:
Sometimes, less information (like mean or variance) suffices
47
Distribution versus realization
Exact distribution (binomial, coin, dice, normal, …) ≠ sure realization
Real life realization = data set
48
Calculation rule mean
E[a∙X + b∙Y + c] = a∙E[X] + b∙E[Y] + c
a,b,c constants; X,Y stochastic; E[∙] as the operator for expectations
49
Calculation rules (co)variance
Variance of X = Var(X) = E[(X - E[X])2]
Var(a∙X + b) = a2∙Var(X)
Var(a∙X ± b∙Y + c) = a2∙Var(X) + b2∙Var(Y) ± 2∙a∙b∙Cov(X,Y)
Covariance of X and Y = Cov(X, Y) = E[(X - E[X])(Y - E[Y])]
Cov(X, X) = Var(X), Cov(a, X) = 0, Cov(a∙X, b∙Y) = a∙b∙Cov(X, Y)
Cov(X+Y, Z) = Cov(X, Z) + Cov(Y, Z) for stochastic Z
Variance adds up in a n-step combination (for example over time)
=> Volatility (= standard deviation σ) increases with √t over time
50
Estimated underlying distribution
Generally: (corrected) average as the estimator for the expected value
First and second moment
Average of the sample = estimator for the mean
Sample variance with n-1 correction for bias
Correction needed since the sample average leads to the lowest
possible variance but is not necessarily the true mean
Properties (unbiasedness, consistency, efficiency, distribution) of
alternative estimators not considered here
51
Optimization: mathematics versus preferences
'Better' along one dimension in most cases easy to define
With several dimensions, however, tradeoffs arise => preferences
Better fit with more parameters
Reasons for choosing distributions with few parameters
Fewer or no tradeoffs between the effects of the single parameters
Lack of data points none of the arguments since usually n >> k
Smooth (behavior at the extremes, calculation, comparability)
52
Measurement errors
'Wer misst, misst Mist'
(Who measures, measures rubbish)
Precision: ex post seemingly precise data which is really rounded
No protection (by tests), but eye inspection might help
Data out of the allowed/expected range
Suspicious frequencies, patterns, or repetitions
Knowledge about the topic crucial
53
Measurement errors, qualitatively
Implicitly supposed (often linear) correlations like 'money makes happy'
Wealth instead of satisfaction (and the origin may matter as well)
Income instead of wealth
Average income instead of individual ones
54
Bias
Experimental setting: Truly equal conditions?
Selection bias (participation)
Willingness to share (political correctness, wealth, etc.)
Same consequences as in real life?
Opposing directions towards 'better':
Expect 159cm/60kg from a person reporting 160/59, to put it mildly
Again: expectations, not realization
The measurement unit may matter itself:
Financial markets with floors and ceilings at levels with 'round' numbers
55
Outliers
Skip or praise?
56
Uncertainty about the distribution
Already 1 realization can exclude some distributions
Distribution tests for the data and the estimated errors of a model
Larger samples mitigate uncertainty about and within a distribution
Most often replaced by (implicit) assumptions (like µ=0)
Often, the assumptions are not explicitly stated or motivated
'Theoretical results in econometrics rely on assumptions/conditions that
have to be satisfied. If they're not, then don't be surprised by the
empirical results that you obtain.' (Dave Giles)
57
To do list
Assume a distribution for your theory that fits your story
This sounds like cheating, but it drives the behavior of test statistics
in data samples that are used to confirm or reject the theory
Be aware of the variety and properties of alternative distributions
Gallery of distributions
Probability, Mathematical Statistics, Stochastic Processes
Statlect – The Digital Textbook
Keep track of your (implicit) assumptions (mean, support, etc.)
58
Questions?
59
Conclusion
Theoretical distributions provide complete quantile information
Closeness of the approximation matters
Properties of combined distributions deductible
Still, exact distribution ≠ sure realization
Measurement errors and unrecognized bias may annihilate the results
Outliers
More likely the result of mistakes than standard data points
Still, some should exist for standard distributions in large samples
Disproportional influence on most estimations (least squares)
60
Distribution tests
TRIBE statistics course
Split, spring break 2016
Goal
Be able to check whether a sample distribution fits a theoretical one
Know the difference between distribution and independence test
Understand that there are alternatives to the discussed tests
62
Actual versus expected categorical distribution
Realization in categories (also called bins in this setting)
Free number of categories
Free size, even unequal ones
Several dimensions at the same time possible
Prior expectations about the outcome within a sample
Contrast with the actual sample
Differences can be random
Differences can arise because the expectations were wrong
63
H0 rejection: a glimpse ahead
Null hypothesis = assume a distribution for the stochastic variable X
Code H0
Alternatives (usually just one) H1, H2, …
As a consequence, test statistics (like the sample mean) exhibit also
a certain distribution
Statistical tests
assess how likely the sample outcome is under the null (p-value)
consider extremes on both or only on one side (1- or 2-sided tests)
reject the null if the sample exhibits 'too extreme' properties
64
Distribution comparisons: the principle (χ2)
The (assumed) underlying distribution
determines the expected number of realizations in each bin
increases these numbers proportionally with the sample size n
fixes thus the chances for each single bin => binomial distributions
Differences
are to be expected
follow a χ2 distribution when squared and added up
rarely exceed certain threshold levels (if H0 is true)
65
χ2 tables online
66
Independence test (χ2)
Identical underlying distributions must lead to similar samples
Differences are not extraordinary, that is the nature of stochastics
Any realization above the expectation leaves less in another bin
Each dimension of categories is not independent of the total
Number of categories and (potentially) independent variables
More bins mean more chances for deviation
Effect captured by the degrees of freedom of the χ2 distribution
#Degrees of freedom = (#columns-1)(#rows-1)
67
χ2 test for independence online (example)
68
Independence in our test sample
Actual
150-170 cm
170-190 cm 190-210 cm Total
0
90
118
52
103
0
52
193
118
208
155
363
Expected (in case of independence)
150-170 cm 170-190 cm 190-210 cm Total
Men
29.7961433 110.589532 67.6143251
Women
22.2038567 82.4104683 50.3856749
Total
52
193
118
208
155
363
Men
Women
Total
Squared deviations
150-170 cm 170-190 cm
Men
887.810153 423.928815
Women
887.810153 423.928815
Total
1775.62031 847.85763
190-210 cm
2538.71624
2538.71624
5077.43248
69
Total
3850.45521
3850.45521
7700.91041
Test
2
3
2
5%
7700.91041
5.99146455
Rows
Columns
Degrees of freedom
Significance level
Test statistic
Critical value
0 p value
Distribution tests
Difference to the independence test
Comparison to an assumed (not sampled) distribution
Same principle in general: not 'too extreme' differences
Same principle for χ2 tests with bins
The assumed distribution can be the one of the total sample. Then
subgroups must be tested separately against this H0
rejection occurs less frequently the larger a subgroup because it
dominates the sample (and hence its distribution) more and more
one often rather choses the distribution of the largest group as H0
70
Distribution test: 1 dimension
Are the players' birthday equally distributed over the 12 months?
Pro
Birthday does not matter for current performance in sports
Relevant effects offset each other (= terrible explanation)
…
Contra
Earlier born persons with an advantage in their teenage peer group
Astrology 
…
71
Precision of the hypothesis matters
In the example of uniform distribution of birthdays of the 12 months
12 equal bins for the calendar months do not consider the diverse
length of the months from 28 to 31 days
Leap years occur
Effect of the relative phase of leap years (How many are relevant?)
Be careful
What do the variable measure?
What does the H0 specify?
What doe you test exactly?
72
Expansion to several dimensions
Joint distribution = distribution of several simultaneous outcomes
Calculation rules analogously to the ones for one dimension
(additive constants and means for independent distributions etc.)
Dependence matters
Test works analogously: subgroups as additional categories
73
Single elements versus the whole distribution
Test for a specific distribution with bins almost always a failure
since the data would have to match ANY bin
Faster exclusion by rejecting the hypothesis that (a combination of)
single characteristics of the sample matches those of a distribution
Natural candidates
Test for equal (also fixed) mean
Comparison of moments in general
Third and fourth moment (skewness and kurtosis) in Jarque-Bera
74
Jarque-Bera
Comparison of (the third and fourth) moments with those of the
Test statistic = (n/6)∙(S2 + (K-3)2/4)
n = sample size, S = skewness, K = Kurtosis
Test statistic asymptotically χ2 distributed with 2 degrees of freedom
Alternatives
Shapiro-Wilk (linear models)
Kolmogorow-Smirnow (continuous functions)
Anderson-Darling (modification of Kolmogorow-Smirnow)
75
Distribution tests in EViews
SeriesViewDescriptive Statistics&TestsEmpirical distribution test
Empirical Distribution Test for HEIGHT
Hypothesis: Normal
Date: 04/04/16 Time: 11:05
Sample (adjusted): 1 364
Included observations: 363 after adjustments
Method
Value
Lilliefors (D)
Cramer-von Mises (W2)
Watson (U2)
Anderson-Darling (A2)
0.072849
0.492771
0.482381
2.867359
Adj. Value
Probability
NA
0.493450
0.483046
2.873333
0.0001
0.0000
0.0000
0.0000
Method: Maximum Likelihood - d.f. corrected (Exact Solution)
Parameter
Value
Std. Error
z-Statistic
Prob.
MU
SIGMA
184.0937
10.91945
0.573122
0.405818
321.2118
26.90725
0.0000
0.0000
Log likelihood
No. of Coefficients
-1382.343
2
Mean dependent var.
S.D. dependent var.
76
184.0937
10.91945
Standardization and transformation
Standardization (only changes σ, not the type of the distribution
transformed x variables labeled as z = (x - µ)/σ
z denotes how many standard deviations away from the mean x is
makes deviations comparable across different dimensions
Transformation (to match a specific distribution better)
changes the interpretation of the variable
compresses/stretches the distribution unequally with respect to X
must preserve the order to make sense
77
To do list
Carefully specify your research question (which may change)
Make sure that you know what you test
Do not confuse sample distribution and underlying distribution
Standardized values to compare distributions of different dimensions
Optional homework: prepare your own data for use in EViews/Excel
78
Questions?
79
Conclusion
Independence tests check for equal distributions among subsamples
Distribution tests check whether the sample distribution matches a
specific distribution well enough
What is compared matters
absolute values or z-scores
number of dimensions and bins
precision of the hypothesis
Standardization allows comparisons across dimensions
Transformation allows closer approximation to a desired distribution
80