Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Gibbs sampling wikipedia , lookup
Quantitative Methods – Week 6: Inductive Statistics I: Standard Errors and Confidence Intervals Roman Studer Nuffield College [email protected] Repetition: Fitting the Regression Line The regression line predicts the values of Y based on the values of X. Thus, the best line will minimise the deviation between the predicted and the actual values (the error, e) Regression line IP=a+bWage =YUK - ŶUK Repetition: The Goodness of Fit yi Regression line TSS ESS y xi x Total variation = explained variation + unexplained variation TSS ESS R²=ESS/TSS USS Homework Y Norway Switrzerland US Brazil Iran 54360 49660 39430 3340 2340 Yi-Mean 24534 19834 9604 -26486 -27486 0 X 81 49 83 21 21 Xi-Mean Product of Deviations 30 736020 -2 -39668 32 307328 -30 794580 -30 824580 0 2622840 Deviation squared (X) 900 4 1024 900 900 3728 This means that the regression coefficient is 703.55. (Product of the Deviations divided the Deviation squared of X). Therefore a = -6055.12661 (29826-(703,55 x 51) B 703.551502 Homework (II) Xi Norway Yi 81 Yi-Mean 54360 24534 Squares 601917156 Explained variation 24534 50932.545 1 445486244.6 Switrzerla nd 49 49660 19834 28418.897 1979938.865 755480196 US 83 39430 9604 52339.648 1 506864349.4 254452992 0 Brazil 21 3340 -26486 8719.4549 4 445486244.6 Iran 21 2340 -27486 8719.4549 4 445486244.6 19834 393387556 US 83 39430 9604 92236816 Brazil 21 3340 -26486 701508196 Iran 21 2340 -27486 Yi-Mean Y Predicted 54360 49660 Yi Yi-Mean 81 49 Xi Yi Norway Switrzerlan d Total Sum of Squares Xi Y Predicted Unexplained variation Norway 81 54360 24534 50932.5451 11747447.34 Switrzerlan d 49 49660 19834 28418.897 451184456.8 US 83 39430 9604 52339.6481 166659013.3 Brazil 21 3340 -26486 8719.45494 28938535.4 Iran 21 2340 -27486 8719.45494 40697445.28 Residual sum of squares 699226898.1 Explained sum of squares 1845303022 The Coefficient of Determination will be therefore: 1845303022/2544529920 = 0.7252039 This will mean that Education is able to account for 72% of the GDP per person. Complete data set: Coefficient of determination: R-squared = 0.5446 a= -2523.28; b= 467.31 Inductive Statistics: Introduction So far, we have only looked at samples, and we will most often only have samples and not entire populations We have described and analysed these samples and computed means, standard deviations, correlation coefficients, regression coefficient, etc. However, because of "the luck of the draw“, the estimated parameters will deviate from the ‘true’ parameters of the whole population (sampling error) We now move from descriptive statistics to inductive statistics… We no longer only describe samples, but we now draw conclusions about characteristics of the entire statistical population based on our sample Chapters 5 & 6 provide the tools necessary to make inferences from a sample Inductive Statistics: Introduction (II) What can we infer from a sample? If we know the sample mean, how good is this an estimator of the population mean? If we calculated the correlation and regression coefficient from a sample of observations, how good is this an estimator of the ‘true’ correlation and regression coefficient? How reliable are our estimates? Sample Biases In a first step, especially when working with historical data, we need to ascertain whether our sample is likely to be representative or whether is may suffer from some serious bias problems… Is the sample of records that has survived representative of the full set of records that was originally recorded? • • Is the sample drawn from the records representative of the information in those records? • • Business records, household inventories Did all records have an equal chance to make their way to the archive (success bias)? Should you computerise information of people whose surname begins with W? Is B possibly a better choice? Rate of return on equity (survivorship bias) Is the information in the records representative of a wider population than that covered by the records? • • Height records of recruits Tax records (selection bias) Sampling will affect the inferences we (can) draw Sampling Distribution 20 10 0 • 15 times, we calculated 55<mean<=57.5 • 4 times, we calculated 52.5<mean<=55 Frequency • 34 times, we calculated 57.5<mean<=60 30 40 Sampling distribution refers to the distribution of the parameters that would be obtained if a large number of random samples of a given size were drawn from a given population; it is a hypothetical distribution Example: We draw a sample of 20 rabbits and then we calculate the mean ear length. After this we let the rabbits free. We repeat this 100 times. We get 100 estimates of mean ear length based on 100 samples of 20 rabbits. The distribution may look like this 50 55 60 65 Mean Ear Lengths of Brown Hare Rabbits (in cm) Sampling Distribution (II) Probability Sampling distribution of the sample mean • m: population mean •X: sample means SE(X) m Sample mean estimatesX The standard error is the estimated standard deviation of the sampling distribution Central Limit Theorem 1. Regardless of shape of the population distribution, as the sample size (of samples used to create the sampling distribution of the mean) increases, the shape of the sampling distribution becomes normal 2. The mean of the sampling distribution will be equal to the ‘true’ but unknown population mean. On average, the known sample mean X will be equal to μ, the unknown population mean 3. The standard deviation of the sample (s) can be taken as the best estimate of the population standard deviation (σ). The standard error (SE) of the sample mean, i.e. the standard deviation of the sampling distribution is therefore s SE ( x ) x N Standard Normal Probability Distribution With the mean (X ) and the standard deviation (SE) of the sampling distribution, we have all the information about the distribution However, we now want to standardise this sampling distribution using Z X mX X with sx x SE ( X ) N The distribution of Z has always a mean of zero and a standard deviation of 1 The proportion of under the curve up to or beyond any specific value of Z can now be obtained from a published table Standard Normal Probability Distribution (II) A standard normal distribution is a normal distribution N(0,1) with mean m=0 and standard deviation =1 95% of cases =1 2,5% of cases 2,5% of cases with -1,96 0 +1,96 Student’s t-distribution Student’s t-distribution is very similar to the standard normal Zdistribution, but adjust for the degrees of freedom (df) X mX Z sX / N X mX t sX / N 1 As the sample size N tends to infinity the t-distribution approximates the standard normal Z-distribution We know the proportion of cases below a certain t-value, e.g. 2.5% of the cases are below t=1.98 for N-1=120 and t=1.96 when N approaches infinity Confidence Intervals We now come back to the question asked before: how good are our estimates of some parameters obtained by the sample? How good an estimator is, say, the sample mean, X, of the what we really want to know, which is the population mean μ? The sample mean can be taken as an estimate of the unknown population mean Though correct on average, a single estimate from an individual sample might differ from the true mean to some extent We can generate an interval in which the "true" (population) mean is located with a specified probability • 90% CI: With a probability of 90%, the interval includes m • 95% CI: In 95 times out of 100, the interval includes m • 99% CI: There is a 99% probability that the interval includes m Confidence Intervals (II) How many standard errors either side of the sample do we have to add to achieve a degree of confidence of 95%? The t-distribution gives the exact value! We know the proportion of cases below a certain t-value, e.g. 2.5% of the cases are below t=1.98 for N-1=120 and t=1.96 when N approaches infinity X t0.025 SE ( X ) Example: Birth rate in English parishes • N= 214 parishes • The mean is 15.636 births per 100 families • Standard error (SE) s is 0.308 births x N • The t0.025 value for the t-distribution for 213 degrees of freedom is 1.971 The 95% confidence interval for the mean birth rate of the population is therefore: 15.636 +/ (1.971 x 0.308) = 15.636 +/ 0.607 Computer Class: • Repetition & Confidence Intervals Exercises Weimar elections: Unemployment and votes for the Nazi Get the dataset about the Weimar election of 1930-33 at http://www.nuff.ox.ac.uk/users/studer/teaching.htm • Look at the variables (votes for the Nazi party, level of unemployment) in turn • Get a first visualisation of the data; does it look normally distributed? • Compute the mean, median, standard deviation, coefficient of variation, kurtosis and skewness for voting share of the Nazi party and the level of unemployment • Estimate the following regression for each of the first two of the four elections (09/30, 03/33): Nazi=a+bUnemployment • Explain in words what the two regression tell you • Draw the respective scatter plots and draw the regression lines • Calculate the 90%, 95% and 99% confidence intervals for a and b • Are the b and the explanatory power of the regression the same for the election in 1930 and the one in 1933? Homework Readings: • Feinstein & Thomas, Ch. 6 • Repeat what we have learned today Problem Set 5: Finish the exercises from today’s computer class if you haven’t done so already. Include all the results and answers in the file you send me.