Download Research Methods of Applied Linguistics and Statistics (10)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Research Methods of Applied
Linguistics and Statistics (11)
Correlation and multiple regression
By Qin Xiaoqing
1
Pearson Correlation
The Pearson correlation allows us to
establish the strength of relationships
between continuous variables.
To show the relationship, the first step is to
draw a scatterplot or scattergram, which
can help us to obtain a preliminary
understanding of this relationship.
The scatterplot can be described in terms
of direction, strength and linearity.
2
Correlation and SPSS
 Pearson product-moment coefficient is designed for
interval level (continuous) variables. It can also be used
if you have one continuous variable (e.g., scores on a
measure of self-esteem) and one dichotomous variable
(e.g., sex: M/F).
 Spearman rank order correlation is designed for use with
ordinal level or ranked data.
 SPSS will calculate two types of correlation. First, it will
give a simple bivariate correlation (which just means
between two variables), also known as zero-order
correlation. SPSS will also explore the relationship
between two variables, while controlling for another
variable. This is known as partial correlation.
3
Direction
 Positive relationships represent relationships in which an
increase in one variable is associated with an increase in a
second.
 Negative relationships represent relationships in which an
increase in one variable is associated with decrease in a
second.
4
Strength
 Strong relationships
appear as those in which
the dots are very close to
a straight line
 Weak relationships
appear as those in which
the dots are more
scattered about a
straight line, or farther
away from that line.
5
Linearity
 Linear relationships are
indicated when the
pattern of dots on the
scatter diagram appears
to be straight, or if the
points could be
represented by drawing a
straight line through them.
6
Steps for computation
1. List the score for each S in parallel columns
on a data sheet.
2. Square each score and enter these values in
the columns labeled X2 and Y2.
3. Multiply the scores and enter this value in the
XY column.
4. Add the values in each column.
5. Insert the values in the formula of correlation
coefficient.
(z x z y )

rxy 
N 1
7
Example
S
1
2
3
4
5
6
7
8
9
10
Total
X
12
10
11
9
8
7
7
5
4
3
76
Y
8
12
5
8
4
13
7
3
8
5
73
X2
144
100
121
181
64
49
49
25
16
9
658
Y2
64
144
25
64
16
169
49
9
64
25
629
XY
96
120
556
72
32
91
49
15
32
15
577
8
Scatterplot
14
12
10
Short-term memory
8
6
4
2
2
4
6
8
10
12
14
9
L2 proficiency
Interpretation of scatterplot
 Checking for outliers
 Inspecting the distribution of data points:
Are the data points spread all over the place? This suggests a
very low correlation.
Are all the points neatly arranged in a narrow cigar shape? This
suggests quite a strong correlation.
Could you draw a straight line through the main cluster of points,
or would a curved line better represent the points? If a curved
line is evident (suggesting a curvilinear relationship), then
Pearson correlation should not be used.
What is the shape of the cluster? Is it even from one end to the
other? Or does it start off narrow and then get fatter. If this is
the case, the data may be violating the assumption of
variance homogeneity.
 Determining the direction of the relationship between the
variables
10
Formula of r for raw score
rxy 
N( XY )  ( X Y)
[ N X  ( X) ][ N Y  ( Y) ]
2
Total
2
2
2
X
Y
X2
Y2
XY
76
73
658
629
577
10(577)  (76)(73)
222
rxy 

 .25
804  963
[(10)(658)  (76) 2 ][(10)(629)  (73) 2 ]
11
Assumptions underlying Pearson
correlation
1. The data are measured as scores or
ordinal scales that are truly continuous.
2. The scores on the two variables, X and Y,
are independent.
3. The data should be normally distributed
through their range.
4. The relationship between X and Y must
be linear.
12
Interpreting the correlation coefficient
1.When r=.60, the
variance overlap
between the 2
measures is .36.
2.The overlap tells that
the 2 measures
provide similar
information. Or the
magnitude of r2
indicates the amount
of variances in X
which is accounted for
by Y or vice versa.
13
Correlation coefficient
If you hope 2 tests measure basically the
same thing, .71 isn’t very strong; .80 or .90
may be desirable.
A correlation of .30 or lower may appear
weak, but in educational research such a
correlation might be very important.
Significant level: p<.05, .01, df=N-2
14
r=.10 to .29 or r=–.10 to –.29 small
r=.30 to .49 or r=–.30 to –.49 medium
r=.50 to 1.0 or r=–.50 to –1.0 large
15
Presenting the results from correlation
16
Comparing the correlation coefficients for
two groups
Sometimes when doing correlational
research you may want to compare the
strength of the correlation coefficients for
two separate groups.
17
Factors affecting correlation
 If you have a restricted range of scores on either
of the variables, this will reduce the value of r, eg.
Age (18-20) and success on an exam.
 The existence of scores with extreme outliers in
the data.
 The presence of extremely high and extremely low
scores on a variable with little in the middle.
 Reliability of the data.
 Non-linear relationship. Always check the
scatterplot, particularly if you obtain low values of r.
18
Correlation versus causality
 Correlation provides an indication that there is a
relationship between two variables It does not
however indicate that one variable causes the
other. The correlation between two variables (A
and B) could be due to the fact that A causes B,
that B causes A, or (just to complicate matters)
that an additional variable (C) causes both A
and B. The possibility of a third variable that
influences both of your observed variables
should always be considered.
19
Statistical vs practical significance
 Don’t get too excited if your correlation coefficients are
‘significant’. With large samples, even quite small
correlation coefficients can reach statistical significance.
Although statistically significant, the practical
significance of a correlation of .2 is very limited. You
should focus on the actual size of Pearson’s r and the
amount of shared variance between the two variables.
To interpret the strength of your correlation coefficient
you should also take into account other research that
has been conducted in your particular topic area. If other
researchers in your area have only been able to predict
9 per cent of the variance (a correlation of .3) in a
particular outcome (e.g., anxiety), then your study that
explains 25 per cent would be impressive in comparison.
In other topic areas, 25 per cent of the variance
explained may seem small and irrelevant.
20
Linear regression
Multiple regression
21
Understanding regression
Regression is a way of predicting
performance on the dependent variable via
one or more independent variables.
In simple regression, we predict scores on
one variable on the basis of scores on a
second.
In multiple regression, we expand the
possible sources of prediction and test to
see which of many variables and which
combination of variables allow us to make
the best prediction.
22
Linear regression
 Regression and correlation are related
procedures. The correlation coefficient is central
to simple linear regression. While we can’t make
causal claims on the basis of correlation, we can
use correlation to predict one variable from
another.
 We can’t just throw variables into a multiple
regression and hope that, magically, answers
will appear.
 We should have a sound thoretical or
conceptual reason for the analysis and, in
particular, the order of variables entering the
equation.
23
Uses of multiple regression
how well a set of variables is able to
predict a particular outcome;
which variable in a set of variables is the
best predictor of an outcome; and
whether a particular predictor variable is
still able to predict an outcome when the
effects of another variable are controlled
for.
24
Assumptions of multiple regression
 Sample size
Stevens (1996) recommends that ‘for social science
research, about 15 subjects per predictor are needed for
a reliable equation’.
Tabachnick and Fidell (1996, p. 132) give a formula for
calculating sample size requirements, taking into
account the number of independent variables that you
wish to use: N > 50 + 8m (where m = number of
independent variables). If you have five independent
variables you would need 90 cases.
More cases are needed if the dependent variable is
skewed.
For stepwise regression there should be a ratio of forty
cases for every independent variable.
25
 Multicollinearity. It exists when the
independent variables are highly correlated (r=.9
and above). Multiple regression doesn’t like
multicollinearity, and it certainly doesn’t
contribute to a good regression model, so
always check for this problem before you start.
 Outliers. Multiple regression is very sensitive to
outliers (very high or very low scores).
 Normality, linearity
26
MLAT and language learning
The closer r is to ±1 the smaller the error will be in
predicting performance on one variable to that of the
second. The smaller, the greater the error.
27
Predicting scores using regression
4 pieces of information are needed: They are
 the mean for scores on one variable;
 The mean for scores on the second variable;
 The S’s score on X, and
 The slope of the best-fitting straight line of the joint
distribution.
With this information, we can predict the S’s score
on Y from X on a mathematical basis. By ‘regressing’
Y on X, predicting Y from X will be possible.
28
Regression line
 Lines drawn to the straight line in the scatterplot show
the amount of error. Suppose we square each of these
errors and then find the mean of the sum of these
squared errors. This best-fitting straight line is called
regression line and is technically defined as the line
that results in the smallest mean of the sum of the
squared errors.
 We can think of the regression line as being that
which is closest to all the dots but, more precisely, it is
the one that results in a mean of the squared errors
that is less than any other line we might produce.
29
Determining the slope
 Turn MLAT and language learning to z score for comparability.
 Then plot the intersection of each S’s z score on the MLAT and
on the test. As the z scores on the MLAT increase they form a
‘run’. The horizontal line of a triangle. At the same time, the z
scores on the test increase to form a ‘rise’, the vertical line.
 The slope (b) of the regression line is shown as we connect
these 2 lines to form the third side of the triangle.
30
Regression coefficient with known r and
SD
 In the diagram, an increase of say 6 units on the run
(MLAT) would equal 2 units of increase on the rise.
 The slope is the rise divided by the run. The result is
a fraction. That fraction is the correlation coefficient.
 The correlation coefficient is the same as the slope
of the best-fitting line in a z-score scatterplot. In the
triangle, the slope of the regression line was 2÷6,
and so r for the two is .33. suppose SDs are 8 and
10 respectively for Y and X.
 To obtain the slope, we multiply the correlation
coefficient by the standard deviation of Y over the
sY
8
standard deviation of X.
b  rxy
 .33  .26
31
sX
10
Regression coefficient with raw data
With r and SD, it is very easy to find the
slope. With raw data, the formula for slope
follows:
( X  X )(Y  Y ) N ( XY ) ( X )(  Y )

b

N ( X )  ( X )
(X  X )
2
2
2
32
Example: using TSE to predict TOEFL
 Mean on TOEFL=540, SD=40. Mean on TSE=30,
SD=4. r=.80, b=8.0
 A student achieved 36 on the TSE, 6 higher than the
mean. Multiplying that by the slope, we get 8×6=48.
So our prediction of TOEFL is mean Y (540)
+48=588. The formula follows:
Ŷ(predicted Y)  Y  b(X  X )  540  (8)(36  30)  588
 Another regression equation is:
Ŷ  a  bx
33
Standard error of estimate
 There is some overlap in the variance of the two
variables. When we square the value of r, we find
the degree of shared variance.
 Of the original 100% of the variance, with an r=.50,
we have accurately accounted for 25% of the
variance using the straight line as the bass for
prediction. The error variance now is reduced to
75%.
 In regression, standard error of estimate (SEE)
shows the dispersion of scores away from the
straight line. If all the data are tightly clustered on
the line, little error is made in prediction.
 SEE tells us how much error is likely to occur in
prediction.
34
Error variance
To compute SEE, we need to know the error
variance, which is the sum of squares of
actual scores minus predicted
scores
2
divided by N-2.  ( Y  Ŷ )
N2
The square root of this variance is referred
to as the SEE (1.35):
 (Y  Ŷ)
N2
2
; or Sy 1  r 2  2.96 1  .892  1.35
35
Mean for X=8, SD=4.47; mean for Y=10.8, SD=2.96; r=.89
Confidence interval
68% confidence interval: ± 1 SEE (eg.±
1.35): 68% of actual Y scores would fall
within .± 1.35 of the predicted Y score.
95% confidence interval: ± 1.96×SEE
99% confidence interval: ±2.58×SEE
Suppose estimated score is 11.98, then
95% confidence interval : between 9.33
(11.98-1.35×1.96) and 14.63
(11.98+1.35×1.96)
99% confidence interval?
8.5(11.98-3.48) - 15.46 (11.98+3.48)
36
Estimated L2 scores predicted from class
hours
37
Goodness of fit for regression model: R2
 R2, also called multiple correlation or the
coefficient of multiple determination, is the
percent of the variance in the dependent
explained uniquely or jointly by the independents.
 Adjusted R2 is an adjustment for the fact that
when one has a large number of independents,
it is possible that R2 will become artificially high
simply because some independents' chance
variations "explain" small parts of the variance of
the dependent.
 The greater the number of independents, the
more the researcher is expected to report the
adjusted coefficient.
38
T-test
t-tests are used to assess the significance
of individual b coefficients. specifically
testing the null hypothesis that the
regression coefficient is zero.
39
F test
F test is used to test the significance of R,
which is the same as testing the
significance of R2, which is the same as
testing the significance of the regression
model as a whole.
If prob(F) < .05, then the model is
considered significantly better than would
be expected by chance and we reject the
null hypothesis of no linear relationship of
y to the independents.
40
Multicollinearity
Multicollinearity is the intercorrelation of
independent variables. R2's near 1 violate
the assumption of no perfect collinearity,
while high R2's increase the standard error
of the beta coefficients and make
assessment of the unique role of each
independent difficult or impossible.
41
tolerance or VIF
 To assess multivariate multicollinearity, one uses
tolerance or VIF, which build in the regressing of each
independent on all the others.
 As a rule of thumb, if tolerance is less than .20, a
problem with multicollinearity is indicated.
 When tolerance is close to 0 there is high
multicollinearity of that variable with other
independents and the b and beta coefficients will be
unstable.
 The more the multicollinearity, the lower the tolerance,
the more the standard error of the regression
coefficients.
42
Selecting method for predicting variables:
Forward selection
 This method starts with a model containing none
of the explanatory variables. In the first step, the
procedure considers variables one by one for
inclusion and selects the variable that results in
the largest increase in R2. In the second step, the
procedures considers variables for inclusion in a
model that only contains the variable selected in
the first step. In each step, the variable with the
largest increase in R2 is selected until, according
to an F-test, further additions are judged to not
improve the model.
43
Backward selection
This method starts with a model containing
all the variables and eliminates variables
one by one, at each step choosing the
variable for exclusion as that leading to the
smallest decrease in R2. Again, the
procedure is repeated until, according to
an F-test, further exclusions would
represent a deterioration of the model.
44
Stepwise selection
This method is, essentially, a combination
of the previous two approaches. Starting
with no variables in the model, variables
are added as with the forward selection
method. In addition, after each inclusion
step, a backward elimination process is
carried out to remove variables that are no
longer judged to improve the model.
45
Interpretation of the results from multiple
regression
Checking the assumptions
Evaluating the model
Evaluating each of the independent
variables
46
Presenting the results of multiple
regression
It would be a good idea to look for
examples of the presentation of different
statistical analysis in the journals relevant
to your topic area. Different journals have
different requirements and expectations.
Given the severe space limitations in
journals these days, often only a brief
summary of the results is presented and
readers are encouraged to contact the
author for a copy of the full results.
47