Download Causality and confounding variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Causality and confounding
variables
• Scientists aspire to measure cause and
effect
• Correlation does not imply causality.
Hume: contiguity + order (cause then
effect) + effect only when cause present
• Confounding variables (extraneous
factors) may intervene and effect both
the proposed cause and effect.
Correlation and Regression
• Steps for making statistical predictions
– Pearson product moment coefficient of correlation (r) – to
measure strength of any linear relationship between
variables – e.g. in bivariate correlation: age and salary level
– Lies in the range -1< r < +1
– -1 perfect negative linear correlation; +1 perfect positive
correlation; 0 no correlation
– Only strength of relationship not cause-effect
Steps for making statistical predictions
continued…
• Having established a correlation (strength)
– Use ‘coefficient of determination’ (r2) to assess what
proportion (%) of the relationship is explained by the
Pearson r correlation
– Evaluate the statistical significance (t-scores) – i.e. set the
risk level of accepting calculated coefficients against null
hypothesis
• The selection of scatter diagrams (next) illustrates
linear correlation principles
A selection of scatter diagrams and associated correlation coefficients
r=+1
r=-1
16
r = + 0.871
25
25
14
y values
y values
20
20
12
10
15
15
8
10
y values
5
10
6
4
0
5
0
2
4
6
x values
8
10
0
2
r = - 0.497
30
25
20
15
10
y values
5
0
0
4
6
x values
8
8
10
4
6
x values
8
10
8
10
r=0
25
20
20
15
15
10
y values
5
0
4
6
x values
2
r = + 0.0037
25
10
y values
5
2
0
10
0
0
2
4
6
x values
8
10
0
2
4
6
x values
Now move on to prediction
• From assessing the strength and power
of a linear correlation between two
variables
• …move on to describing the nature of
the relationship to assist in predicting
The equation of a regression line has the form:
Y = a + bX
where Y is the dependent variable (the one we wish to predict /
explain) and X is the independent variable. The value “a” is known
as the intercept of the line and “b” measures the gradient of this line.
Worked Example
• LOS and age is correlated as r = 0.87207 from a
survey of 30 employees in a firm
• r (above) and r2 (0.760508) are strong – although this
still leaves residuals at 24% (i.e. due to extraneous
factors)
• Is this significant?
• Can we predict mean LOS at age 40?
• What is the 95% confidence interval for LOS derived
from one extra year age?
Plotting the data we can see…
30
20
SERVICE
10
0
Rsq = 0.7605
10
20
30
40
50
60
AGE
The equation of the line linking length of service (y) and age (x) is:
Y = -8.2194 + 0.45727x and SPSS reveals these coefficients for us
This equation can be used to predict LOS at a selected age.
Where do the figures come from to drop into the Y=a+bX equation?
An SPSS regression printout gives us the data needed to solve the problem:
Variables Entered/Removed
Model
Variables Entered
1
AGE
a All requested vari ables entered.
b Depend ent Variable: LOS
Model Summary
Model
R
R Square
1
.872
.761
a Predictors: (Constant), AGE
b Depend ent Variable: LOS
Variables Removed
.
Adjusted R Square
.752
Method
Enter
Std. Error of the Estimate
2.63
Coefficients
Unstandardized
Coefficients
Model
B
1
(Constant)
-8.219
AGE
.457
a Depend ent Variable: SERVICE
Casewise Diagnostics
Case
Std. Residual
Number
2
3.385
a Depend ent Variable: LOS
Standardiz ed
Coefficients
Std. Error
Beta
1.657
.048
.872
t
-4.961
9.429
Sig.
95%
Confi dence
Interv al for B
Lower Bound
.000
.000
SERVICE
Predicted Value
Residua l
24
15.10
8.90
-11.613
.358
Upper
Bound
-4.826
.557
Interpretation of the SPSS output
Variables Entered/Removed
This simply tells us that ‘age’ was the independent variable and ‘service’ the dependent
variable.
Model summary
The value of the correlation coefficient (r) was 0.872 and the value of r2 was 0.761.
Coefficients
The ‘unstandardized coefficients’ give us the values of a and b in the regression
equation. Thus the equation here is y = -8.219 + 0.457x
The final column ‘Sig.’ gives values less than 0.01 thus we can say that the coefficients
of the regression equation are significantly different from zero at the 1% (0.01) level (and
thus at 5% (0.05) level).
Casewise diagnostics
During the input dialogue, SPSS was asked to show any standardised residuals outside
the range -3 to + 3. The output shows that one reading, case number 2, had a large
standardised residual. This indicates that this point does not fit the general trend of the
straight line and can be regarded as an ‘outlier’ (i.e. an unusual reading).
The solution…
Y = a + bX
(where Y is LOS; X is age)
Y = -8.2194 + 0.45727x
Y = -8.2194 + 0.45727(40)
Y = -8.2194 + 18.29
Y = 10.07 years’ service predicted at age 40*
And … there is a 95 per cent probability that the mean additional LOS for each extra year
in age lies in the range: 0.358 to 0.557 (as supplied in the SPSS output).
* Have a glance back at the scattergram to check this visually
Basic Quants: A Summary
•
•
•
•
We have introduced the modelling concept
We have reflected on data types/displays
We have engaged with probability theory
We have touched on
– Significance testing of hypotheses using both parametric and
non-parametric statistics
– Prediction from what is known to make an informed estimate
of the variable of interest
» Work through the assignment with the booklet provided
alongside and this will guide solution of every aspect!