Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Class 5 ANOVA, correlations and causal inferences The t-test for comparing means of independent samples can be used when there are two groups. What if we are interested in comparing more than two group means? We can carry out comparisons between two groups at a time. But this is likely to be tedious when there are many groups. The other problem is that when many significance tests are carried out, we are likely to find a few tests significant by chance as compared to carrying out only one test. The one-way analysis of variance (ANOVA) is a test that is suitable for testing the hypothesis that H 0 : 1 2 3 ... k The null hypothesis is that all group means are equal. Rejection of the null hypothesis means that at least one group mean is not equal to the others. One can regard one-way ANOVA as testing the equality of all group means simultaneously. Typically, the variable for which the group mean is compared should be continuous, and the group variable is categorical. For example, compare mean mathematics achievement across countries where mathematics achievement is a continuous variable while country is a categorical variable. The idea of one-way analysis of variance is to compare the variance (amount of variation) between group means, and the variance within each group. The two variances are computed as a ratio, and an F-test based on an F-distribution is used to test the statistical significance of the hypothesis that all group means are equal. If the group means are actually equal, then the variance of the group means will be close to zero. The following is an example of using a subset of PISA 2006 data to look at differences between the average mathematics achievement for four countries. We should make clear that the analyses used below assume that the samples are simple random samples. This is actually not the case in PISA. So to analyse PISA data appropriately we need to use more complex procedures. In this document, the analyses are just used as illustrations. The data set is called pisa4c.csv, containing data for Australia, Germany, Japan and Mexico. Each line in the data file shows data from one student. There are 11 variables. These variables are: Variable name country gender family hisei fisced mmins homepos math read science weight Explanation Name of the country where the student is from Girl or boy. 1=girl; 2=boy Family structure. 1=single parent; 2=both parents Highest parental (mother or father) occupational status Educational level of father Minutes of mathematics lessons per week Index of home possession Mathematics achievement Reading achievement Science achievement Student sampling weight The following R code is for reading the data: setwd("C:/G_MWU/Taiwan/DrTam/2014/NovClass/Class5") pisadata <- read.csv("PISA4c.csv") head(pisadata) attach(pisadata) The “attach” statement means: The data set “pisadata” is attached to the R search path. This means that the data set is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names, e.g., “country” instead of “pisadata$country”. This simplifies the variable names. Before examining country differences, we will compute some descriptive statistics: table(country) mean(math, na.rm=TRUE) We get the results: > table(country) country Australia Germany Japan 5446 4660 4707 > mean(math,na.rm=TRUE) [1] 493.1241 Mexico 4950 The “table” command tells us that there are around 5000 students in each country. For computing country mean scores, we use the option “na.rm=TURE” to remove missing values. In R, missing values are coded as NA (not available). In real data sets, we will nearly always encounter missing responses. There is no built-in function to compute standard errors in R, so we will write a simple function for standard error: stderr <- function(x){sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))} In the above command, we define a function called stderr. To call this function, we simply use stderr(x) where x is a vector of data values. For example, stderr(math) [1] 0.7608169 In defining the standard error function, we have made sure that missing values, NA, are omitted. To compute mean scores for each country, we use the “aggregate” command: aggregate(math, list(country), mean, na.rm=TRUE) aggregate(math, list(country), stderr) aggregate(math, list(country), function(x) {length(na.omit(x))}) > aggregate(math,list(country), mean, na.rm=TRUE) Group.1 x 1 Australia 522.5141 2 Germany 508.4461 3 Japan 532.9815 4 Mexico 408.4641 > aggregate(math,list(country), stderr) Group.1 x 1 Australia 1.318838 2 Germany 1.476764 3 Japan 1.455165 4 Mexico 1.133703 > aggregate(math,list(country), function(x) {length(na.omit(x))}) Group.1 x 1 Australia 5446 2 Germany 4660 3 Japan 4707 4 Mexico 4950 Before testing the equality of the country mean scores, use boxplot to get a visual representation of the differences in mathematics between the four countries. boxplot(math~country, main="mathematics achievement") Judging from the boxplot and the mean scores of the four countries, we will probably guess that there are differences between the four country means. To carry out a statistical significance test, an analysis of variance can be used. > m1 <- aov(math~country) > summary(m1) country Residuals Df Sum Sq Mean Sq F value Pr(>F) 3 48753927 16251309 1811 <2e-16 *** 19759 177316642 8974 An F-test is carried out and the p-value is less than 2e-16. The symbol of three asterisks (“***”) means that the p-value is extremely small. So the conclusion is that we reject the hypothesis that the country means are all the same. While this information from ANOVA may have answered our question of whether the four countries have similar mathematics achievement, it is not extremely helpful, as we don’t know whether the four countries are all different, or, the difference is just between one pair of countries. A pair-wise comparison may help answer the question of which countries are different. These pair-wise tests are called post-hoc tests. There are many different post-hoc tests. The main purpose of these tests is to adjust for the p-values because of multiple comparisons. In this example, we will use Tukey’s HSD (honest significant difference) > TukeyHSD(m1) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = math ~ country) $country diff lwr upr Germany-Australia -14.06801 -18.924879 -9.211137 Japan-Australia 10.46739 5.623604 15.311177 Mexico-Australia -114.05002 -118.829611 -109.270436 Japan-Germany 24.53540 19.505793 29.565004 Mexico-Germany -99.98202 -104.949823 -95.014207 Mexico-Japan -124.51741 -129.472430 -119.562397 p adj 0e+00 2e-07 0e+00 0e+00 0e+00 0e+00 Compare Tukeys’ HSD with the t-tests: > t.test(math[country=="Germany"],math[country=="Australia"]) Welch Two Sample t-test data: math[country == "Germany"] and math[country == "Australia"] t = -7.1053, df = 9748.399, p-value = 1.285e-12 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.94910 -10.18691 sample estimates: mean of x mean of y 508.4461 522.5141 A t-test will show more significant results than pair-wise multiple comparisons. In the above example, the confidence interval from t-test is narrower than the confidence interval from Tukey’s HSD test. Correlations and regression It is unfortunate that in regression analysis in statistics, the variables are termed explanatory (X) and dependent (Y) variables, in the regression equation Y = a + bX. Such nomenclature suggests a causal relationship, i.e., X has an impact on Y. But in fact if we reverse the equation and fit the model X = a + bY, we obtain exactly the same statistical significance result, as illustrated in the example below. Consider the relationship between mathematics achievement (math) and home possession (homepos) for Japan only. The command in R for regression is of the form “lm(y~x)”. So lm(math~homepos) will use math as dependent variable and homepos as independent variable. To do the regression for Japan only, we add “country==Japan”) > m5 <- lm(math[country=="Japan"]~homepos[country=="Japan"]) > summary(m5) Call: lm(formula = math[country == "Japan"] ~ homepos[country == "Japan"]) Residuals: Min 1Q -399.98 -67.36 Median 5.12 3Q 69.96 Max 332.19 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.329e+02 1.461e+00 364.875 <2e-16 *** homepos[country == "Japan"] 7.454e-03 1.649e-02 0.452 0.651 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 99.84 on 4705 degrees of freedom Multiple R-squared: 4.342e-05, Adjusted R-squared: 0.0001691 F-statistic: 0.2043 on 1 and 4705 DF, p-value: 0.6513 - To reverse the regression equation, we use lm(homepos~math): > m6 <- lm(homepos[country=="Japan"]~math[country=="Japan"]) > summary(m6) Call: lm(formula = homepos[country == "Japan"] ~ math[country == "Japan"]) Residuals: Min 1Q Median -11.45 -8.38 -7.87 3Q Max -7.36 993.76 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.422123 6.988931 0.633 0.527 math[country == "Japan"] 0.005826 0.012889 0.452 0.651 Residual standard error: 88.27 on 4705 degrees of freedom Multiple R-squared: 4.342e-05, Adjusted R-squared: 0.0001691 F-statistic: 0.2043 on 1 and 4705 DF, p-value: 0.6513 - The results of significance test between model m5 and m6 are the same. In fact, if we compute the correlation between the two variables, we get the same results in p-value. > cor.test(homepos[country=="Japan"],math[country=="Japan"]) Pearson's product-moment correlation data: homepos[country == "Japan"] and math[country == "Japan"] t = 0.452, df = 4705, p-value = 0.6513 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.02198345 0.03515223 sample estimates: cor 0.006589764 What the above tells us is that statistics alone does not tell us about causal relationships. Statistics only tells us about correlation. It is up to the researchers to make causal inferences. Are babies delivered by storks? (see pdf document) Two variables can often be correlated through another variable called mediating variabl. For example, it has been found that ice cream sales are correlated with crime rates. This is not because there is any real relationship between ice cream sales and crime rates, but because there is an increased crime rate in summer, and at the same time, an increase in ice cream sale in summer. So in this case we call the “time of the year” a mediating variable.