Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Central limit theorem wikipedia , lookup
Misuse of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Sampling and hypothesis testing Eoin O’Malley School of Law and Government, DCU Hypotheses • An hypothesis is a proposal or a candidate model • In statistics Null and Alternative hypotheses are used – H0 and HA – e.g. H0: X=Y and Ha: X≠ Y Normal distributions Galton’s Bean Machine Z Scores • Remember that the normal distribution has some qualities that allow us to say this • We know that 95% of the data is within 1.96 SDs of the mean (population) etc. • The 1.96 comes from the Z distribution • So if we want to say with 95% confidence that a sample is (un)likely to come from a (known) population we can use Z scores Z scores: a (hypothetical) example • Suppose support for the largest party in Ireland is = 40.5 • Say, the mean for all of west European lead parties is 29% (N= 410 incl. Ire.) std dev = 4.9 • Is Ireland unusual? Z-score example (cont.) The formula to work out a z-score is; Z= x–μ σ Here, this is = 40.5- 29 4.9 = 2.346 The z-table tells us that .009 will have higher Is Ireland unusual? What if we want to compare means? • We may have a sample from the population, say ten Irish post war top party support levels – (This obviously isn’t a random sample, but what we might want to test is whether it may as well be random) • Then we use the Standard Error of the Mean • This bit is NB!!! The standard deviation and standard error of the mean Central Limit Theorem • (Roughly stated) says that as sample sizes get larger, the sampling distribution for a variable X approaches a normal distribution with a mean equal to the population, and a standard deviation equal to the population standard deviation divided by the square root of the sample size – S. Lynch 2000 Sampling distribution • When we collect a sample, n=1or n=1000 and take the mean we have one data point in the sampling distribution of the mean • When we do it many times we have the distribution which is becomes normally distributed as the sample increases Central Limit Theorem Sampling Distribution for n=1 80 freq 60 40 20 0 0 0.5 m ean 1 Central Limit Theorem freq Sampling Distribution for n=2 120 100 80 60 40 20 0 0 0.5 m ean 1 Central Limit Theorem Sampling Distribution for n=10 250 freq 200 150 100 50 0 0 0.5 m ean 1 Central Limit Theorem Sampling Distribution for n=100 500 freq 400 300 200 100 0 0 0.5 m ean 1 Central Limit Theorem Empirical and Theoretical Standard Deviations of Sampling Distributions for U(0,1) by Sample Size 0.35 0.3 S.D. 0.25 0.2 0.15 0.1 0.05 0 0 10 20 30 40 50 60 70 Sample Size Empirical s.d. Theoretical s.d. 80 90 100 Z-test Z= –μ σ/ √ n • In our example – 38.4 – 29 4.9/ √ 10 ≈ 9.4 1.5 • The z-tables have a p value <.001 ≈ 6.266 What does this mean? • The p (or alpha level) is a probability • What is the probability of having a sample statistic of this magnitude if the null hypothesis is true? • So it is quite improbable that if Ireland came from the European population that you’d get data like this by chance. Ireland is probably different • We say we reject the null hypothesis that Ireland = Europe at the .05 level for α. For samples • Student t-test - the t-statistic is worked out by t= – μ0 sX / √ n -We then look up the t distribution like we did the Z -However there are many t-distributions Example (comparing two groups) • Suppose we didn’t have time or resources to collect all the data on first placed parties • Instead we took a random sample (incl. Ireland) • The figures are – Mean (Ire) = 40.5 (n=6) sd = 6.6; – Mean (rest) = 29.6 (n=70) sd = 9.3) Example Here we are comparing the (unpaired) difference of two means, so… t= X-bar1 – x-bar2 sX1 / √ n1+ sX2 / √ n2 Output in Stata . ttest europe= ireland, unpaired Two-sample t test with equal variances Variable europe | ireland | Obs 70 5 Mean 29.5959 40.504 Std. Err. 1.111507 2.941091 Ho: mean(europe)- mean(ireland) = diff = Ha: diff < 0 Ha: diff ~= 0 t = -2.5693 t = -2.5693 P < t = 0.0061 P > |t| = 0.012 Std. Dev. 9.299532 6.576479 0 Ha: diff > 0 t = -2.5693 P > t = 0.9939 Types of errors • Type I and Type II • Type I is claiming a relationship that in fact doesn’t exist (convict an innocent man) • Type II is rejecting a relationship that actually is the truth (release a guilty man) • Type I is usually thought of being worse than Type II In stata • Open nes2004.dta in stata