Download Class5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Omnibus test wikipedia , lookup

Analysis of variance wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Class 5
ANOVA, correlations and causal inferences
The t-test for comparing means of independent samples can be used when there are
two groups. What if we are interested in comparing more than two group means? We
can carry out comparisons between two groups at a time. But this is likely to be
tedious when there are many groups. The other problem is that when many
significance tests are carried out, we are likely to find a few tests significant by
chance as compared to carrying out only one test.
The one-way analysis of variance (ANOVA) is a test that is suitable for testing the
hypothesis that
H 0 : 1  2  3  ...  k
The null hypothesis is that all group means are equal. Rejection of the null hypothesis
means that at least one group mean is not equal to the others. One can regard one-way
ANOVA as testing the equality of all group means simultaneously.
Typically, the variable for which the group mean is compared should be continuous,
and the group variable is categorical. For example, compare mean mathematics
achievement across countries where mathematics achievement is a continuous
variable while country is a categorical variable.
The idea of one-way analysis of variance is to compare the variance (amount of
variation) between group means, and the variance within each group. The two
variances are computed as a ratio, and an F-test based on an F-distribution is used to
test the statistical significance of the hypothesis that all group means are equal. If the
group means are actually equal, then the variance of the group means will be close to
zero.
The following is an example of using a subset of PISA 2006 data to look at
differences between the average mathematics achievement for four countries. We
should make clear that the analyses used below assume that the samples are simple
random samples. This is actually not the case in PISA. So to analyse PISA data
appropriately we need to use more complex procedures. In this document, the
analyses are just used as illustrations.
The data set is called pisa4c.csv, containing data for Australia, Germany, Japan and
Mexico. Each line in the data file shows data from one student. There are 11
variables. These variables are:
Variable name
country
gender
family
hisei
fisced
mmins
homepos
math
read
science
weight
Explanation
Name of the country where the student is from
Girl or boy. 1=girl; 2=boy
Family structure. 1=single parent; 2=both parents
Highest parental (mother or father) occupational status
Educational level of father
Minutes of mathematics lessons per week
Index of home possession
Mathematics achievement
Reading achievement
Science achievement
Student sampling weight
The following R code is for reading the data:
setwd("C:/G_MWU/Taiwan/DrTam/2014/NovClass/Class5")
pisadata <- read.csv("PISA4c.csv")
head(pisadata)
attach(pisadata)
The “attach” statement means: The data set “pisadata” is attached to the R search path.
This means that the data set is searched by R when evaluating a variable, so objects in
the database can be accessed by simply giving their names, e.g., “country” instead of
“pisadata$country”. This simplifies the variable names.
Before examining country differences, we will compute some descriptive statistics:
table(country)
mean(math, na.rm=TRUE)
We get the results:
> table(country)
country
Australia
Germany
Japan
5446
4660
4707
> mean(math,na.rm=TRUE)
[1] 493.1241
Mexico
4950
The “table” command tells us that there are around 5000 students in each country. For
computing country mean scores, we use the option “na.rm=TURE” to remove missing
values. In R, missing values are coded as NA (not available). In real data sets, we will
nearly always encounter missing responses.
There is no built-in function to compute standard errors in R, so we will write a
simple function for standard error:
stderr <- function(x){sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))}
In the above command, we define a function called stderr. To call this function, we
simply use stderr(x) where x is a vector of data values. For example,
stderr(math)
[1] 0.7608169
In defining the standard error function, we have made sure that missing values, NA,
are omitted.
To compute mean scores for each country, we use the “aggregate” command:
aggregate(math, list(country), mean, na.rm=TRUE)
aggregate(math, list(country), stderr)
aggregate(math, list(country), function(x) {length(na.omit(x))})
> aggregate(math,list(country), mean, na.rm=TRUE)
Group.1
x
1 Australia 522.5141
2
Germany 508.4461
3
Japan 532.9815
4
Mexico 408.4641
> aggregate(math,list(country), stderr)
Group.1
x
1 Australia 1.318838
2
Germany 1.476764
3
Japan 1.455165
4
Mexico 1.133703
> aggregate(math,list(country), function(x) {length(na.omit(x))})
Group.1
x
1 Australia 5446
2
Germany 4660
3
Japan 4707
4
Mexico 4950
Before testing the equality of the country mean scores, use boxplot to get a visual
representation of the differences in mathematics between the four countries.
boxplot(math~country, main="mathematics achievement")
Judging from the boxplot and the mean scores of the four countries, we will probably
guess that there are differences between the four country means. To carry out a
statistical significance test, an analysis of variance can be used.
> m1 <- aov(math~country)
> summary(m1)
country
Residuals
Df
Sum Sq Mean Sq F value Pr(>F)
3 48753927 16251309
1811 <2e-16 ***
19759 177316642
8974
An F-test is carried out and the p-value is less than 2e-16. The symbol of three
asterisks (“***”) means that the p-value is extremely small. So the conclusion is that
we reject the hypothesis that the country means are all the same.
While this information from ANOVA may have answered our question of whether the
four countries have similar mathematics achievement, it is not extremely helpful, as
we don’t know whether the four countries are all different, or, the difference is just
between one pair of countries.
A pair-wise comparison may help answer the question of which countries are
different. These pair-wise tests are called post-hoc tests. There are many different
post-hoc tests. The main purpose of these tests is to adjust for the p-values because of
multiple comparisons. In this example, we will use Tukey’s HSD (honest significant
difference)
> TukeyHSD(m1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = math ~ country)
$country
diff
lwr
upr
Germany-Australia -14.06801 -18.924879
-9.211137
Japan-Australia
10.46739
5.623604
15.311177
Mexico-Australia -114.05002 -118.829611 -109.270436
Japan-Germany
24.53540
19.505793
29.565004
Mexico-Germany
-99.98202 -104.949823 -95.014207
Mexico-Japan
-124.51741 -129.472430 -119.562397
p adj
0e+00
2e-07
0e+00
0e+00
0e+00
0e+00
Compare Tukeys’ HSD with the t-tests:
> t.test(math[country=="Germany"],math[country=="Australia"])
Welch Two Sample t-test
data:
math[country == "Germany"] and math[country == "Australia"]
t = -7.1053, df = 9748.399, p-value = 1.285e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.94910 -10.18691
sample estimates:
mean of x mean of y
508.4461 522.5141
A t-test will show more significant results than pair-wise multiple comparisons. In the
above example, the confidence interval from t-test is narrower than the confidence
interval from Tukey’s HSD test.
Correlations and regression
It is unfortunate that in regression analysis in statistics, the variables are termed
explanatory (X) and dependent (Y) variables, in the regression equation Y = a + bX.
Such nomenclature suggests a causal relationship, i.e., X has an impact on Y. But in
fact if we reverse the equation and fit the model X = a + bY, we obtain exactly the
same statistical significance result, as illustrated in the example below.
Consider the relationship between mathematics achievement (math) and home
possession (homepos) for Japan only.
The command in R for regression is of the form “lm(y~x)”. So lm(math~homepos)
will use math as dependent variable and homepos as independent variable. To do the
regression for Japan only, we add “country==Japan”)
> m5 <- lm(math[country=="Japan"]~homepos[country=="Japan"])
> summary(m5)
Call:
lm(formula = math[country == "Japan"] ~ homepos[country == "Japan"])
Residuals:
Min
1Q
-399.98 -67.36
Median
5.12
3Q
69.96
Max
332.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
5.329e+02 1.461e+00 364.875
<2e-16
***
homepos[country == "Japan"] 7.454e-03 1.649e-02
0.452
0.651
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 99.84 on 4705 degrees of freedom
Multiple R-squared: 4.342e-05,
Adjusted R-squared:
0.0001691
F-statistic: 0.2043 on 1 and 4705 DF, p-value: 0.6513
-
To reverse the regression equation, we use lm(homepos~math):
> m6 <- lm(homepos[country=="Japan"]~math[country=="Japan"])
> summary(m6)
Call:
lm(formula = homepos[country == "Japan"] ~ math[country == "Japan"])
Residuals:
Min
1Q Median
-11.45 -8.38 -7.87
3Q
Max
-7.36 993.76
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
4.422123
6.988931
0.633
0.527
math[country == "Japan"] 0.005826
0.012889
0.452
0.651
Residual standard error: 88.27 on 4705 degrees of freedom
Multiple R-squared: 4.342e-05,
Adjusted R-squared:
0.0001691
F-statistic: 0.2043 on 1 and 4705 DF, p-value: 0.6513
-
The results of significance test between model m5 and m6 are the same. In fact, if we
compute the correlation between the two variables, we get the same results in p-value.
> cor.test(homepos[country=="Japan"],math[country=="Japan"])
Pearson's product-moment correlation
data: homepos[country == "Japan"] and math[country == "Japan"]
t = 0.452, df = 4705, p-value = 0.6513
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.02198345 0.03515223
sample estimates:
cor
0.006589764
What the above tells us is that statistics alone does not tell us about causal
relationships. Statistics only tells us about correlation. It is up to the researchers to
make causal inferences.
Are babies delivered by storks? (see pdf document)
Two variables can often be correlated through another variable called mediating
variabl. For example, it has been found that ice cream sales are correlated with crime
rates. This is not because there is any real relationship between ice cream sales and
crime rates, but because there is an increased crime rate in summer, and at the same
time, an increase in ice cream sale in summer. So in this case we call the “time of the
year” a mediating variable.