Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GV207 – Political Analysis, Week 08 Department of Government, University of Essex Testing Differences in Means: The t-test Introduction: Previously we have looked at comparing a sample mean for a variable to some assumed/hypothesised “true value” of the mean for a variable.1 To do so we calculated a z-score which gave us the likelihood of observing our sample mean, given our assumption about the population mean. This week we are going to move on from this and attempt to construct a test statistic for the comparison of two sample means. Some potential applications of this could be: Is the mean GDP per capita (statistically) significantly higher in democracies than autocracies? Do countries with civil wars have a lower level of human development than those that do not? In order for us to test these and other claims, we will use a t-test. The t-test: The logic of the t-test is the same as what we have done in previous weeks using the z-score and the χ2 test statistic: State your null hypothesis; Choose the level of significance (i.e. 95%, looking for a p-value < 0.05); Calculate the test-value given your sample (i.e. what is the value of z, χ2, etc.); Calculate the p-value given the test-value and its distribution, and check if this is less than 0.05. And that’s it. Simple. Well, nothing is that simple: We have an added complication given that we want to compare two sample means to one another.2 The assumption embedded in the standard t-test in Stata is that the variances of the two samples which we use to compare the means are equal. This is rarely, if ever, met in real life. As a result the t-test we will use includes a correction for unequal variances. To do this we use the unequal option of the ttest command. Testing differences in means using Stata: Let’s say we are interested in seeing whether the mean of GDP per capita is significantly higher for democracies compared to autocracies. To compute our t-test we need the variable we calculate the means for, GDP per capita (gdppc2000), and the variable, which groups the countries into democracies and autocracies (aclp_democ2000). Below is the command and the resulting output from Stata. 1 See notes from week 5, where we worked with the example of a sample mean of Labour support compared to a hypothesised true population mean. 2 Nothing Stata can’t handle though. 1 GV207 – Political Analysis, Week 08 Department of Government, University of Essex . ttest gdppc2000, by(aclp_democ2000) unequal Two-sample t test with unequal variances Group Obs Mean autocrac democrac 47 94 combined 141 diff Std. Err. Std. Dev. [95% Conf. Interval] 4234.298 10927.03 650.2365 1059.029 4457.797 10267.66 2925.44 8824.011 5543.156 13030.05 8696.121 783.6388 9305.195 7146.825 10245.42 -6692.734 1242.718 -9150.128 -4235.34 diff = mean(autocrac) - mean(democrac) t = Ho: diff = 0 Satterthwaite's degrees of freedom = Ha: diff < 0 Pr(T < t) = 0.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 -5.3856 136.979 Ha: diff > 0 Pr(T > t) = 1.0000 Question: What do we see from these results? Could we have relied upon the equal variance assumption here? How to interpret the output: These are the basic steps for interpreting this output: First look at the means for each group. Do they look different? How much do they differ? What’s the direction of the difference? Is the difference positive or negative? 3 If the difference is positive (i.e. t is positive), look at the pr-value for Ha: diff > 0. If the difference is negative (i.e. t is negative), look at the pr-value for Ha: diff < 0. Is there a statistically significant positive or negative difference (i.e. is the p-value less than 0.05)? Once you answer all these questions you’ve interpreted the output. Draw inferences at your own risk. What about controlling for a third variable (Z)? As we already saw last week, it is very rare that there is only one variable that affects our dependent variable.4 As we did with crosstabs we can use an if condition in Stata to conduct the t-tests for different groups within Z. In this case we will see what role democracy has on GDP per capita comparing African countries and non-African countries (using the region variable).5 African Countries (i.e. region == 1): 3 It’s always good to look at the diff equation just below the table and above the p-values. In this case our difference = mean(autocracy) - mean(democracy). 4 In our specific example we have a whole host of other problems as there’s probably reverse causality too. Let’s just ignore that elephant in the room. 5 You can come up with some ad-hoc theory if you want. 2 GV207 – Political Analysis, Week 08 Department of Government, University of Essex . ttest gdppc2000 if region == 1, by(aclp_democ2000) unequal Two-sample t test with unequal variances Group Obs Mean autocrac democrac 22 19 combined 41 diff Std. Err. Std. Dev. [95% Conf. Interval] 2762.545 2840.526 711.5586 751.9999 3337.506 3277.892 1282.778 1260.633 4242.313 4420.42 2798.683 510.4889 3268.724 1766.946 3830.419 -77.98086 1035.287 -2173.224 2017.262 diff = mean(autocrac) - mean(democrac) t = Ho: diff = 0 Satterthwaite's degrees of freedom = Ha: diff < 0 Pr(T < t) = 0.4702 Ha: diff != 0 Pr(|T| > |t|) = 0.9403 -0.0753 38.3267 Ha: diff > 0 Pr(T > t) = 0.5298 Non-African Countries (i.e. region != 1): . ttest gdppc2000 if region != 1, by(aclp_democ2000) unequal Two-sample t test with unequal variances Group Obs Mean autocrac democrac 25 75 5529.44 12975.61 combined 100 diff Std. Err. Std. Dev. [95% Conf. Interval] 992.139 1204.64 4960.695 10432.49 3481.766 10575.32 7577.114 15375.91 11114.07 989.0374 9890.374 9151.605 13076.53 -7446.173 1560.608 -10548.47 -4343.879 diff = mean(autocrac) - mean(democrac) t = Ho: diff = 0 Satterthwaite's degrees of freedom = Ha: diff < 0 Pr(T < t) = 0.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 -4.7713 86.179 Ha: diff > 0 Pr(T > t) = 1.0000 Question: What conclusions can we draw from these t-tests? Additional reading: If this still isn’t sinking in, try this web resource with annotated output. It may help. http://www.ats.ucla.edu/stat/stata/output/ttest_output.htm 3 GV207 – Political Analysis, Week 08 Department of Government, University of Essex Exercise: As in the last few weeks we will be using the data set Democracy small.dta. 1. Let us analyse the relationship between a country’s regime type (aclp_democ2000) and its school enrolment rate (educ2001). First, use the summarize command with the appropriate if condition to calculate the mean of the school enrolment variable for each of the two categories of the regime type dummy variable. What do you see? Do you think this relationship will be statistically significant? 2. Use the ttest command with the unequal option as above, to test whether there is a statistically significant relationship between a country’s regime type and its school enrolment rate. 3. First check whether we would have had a problem with the equal variance assumption if we had conducted the t-test normally. Are the variances equal or not? And by how much do they differ? 4. Now interpret the results of the t-test. Is the relationship statistically significant? 5. Next, let us control for a country’s level of development which might influence the relationship between regime type and school enrolment. In order to do so, create a new dummy variable, which groups countries into low income countries (i.e. GDP per capita < US$ 5000) and high income countries (i.e. GDP per capita > US$ 5000): generate incomecat = . replace incomecat = 1 if gdppc2000 < 5000 replace incomecat = 2 if gdppc2000 >= 5000 & gdppc2000 != . 6. Use the appropriate if conditions to calculate the t-test separately for each of the two categories of the income variable. 7. Interpret the results. Do your previous conclusions about the relationship between regime type and school enrolment change? Is there an interaction? The following are some more difficult tasks that test some skills that you have already learnt: 8. Create a kernel density plot (kdensity) of the school enrolment variable for each category of the income dummy variable. You will need to use if conditions to do so. How does the distribution of school enrolment differ across the two different income categories? 9. For a real challenge try to get these two kernel density plots on the same graph. 6 You will need to use the addplot option in order to do so. 10. Find a interval level variable in the data set that you think also has an effect upon school enrolment. Transform this variable into a dummy variable based on some category of your choice. 11. Conduct a t-test using your new dummy variable and the school enrolment variable. Interpret. 6 Also if you can do so it will make the comparison between the kernel density plots much easier. 4