Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Causality and confounding variables • Scientists aspire to measure cause and effect • Correlation does not imply causality. Hume: contiguity + order (cause then effect) + effect only when cause present • Confounding variables (extraneous factors) may intervene and effect both the proposed cause and effect. Correlation and Regression • Steps for making statistical predictions – Pearson product moment coefficient of correlation (r) – to measure strength of any linear relationship between variables – e.g. in bivariate correlation: age and salary level – Lies in the range -1< r < +1 – -1 perfect negative linear correlation; +1 perfect positive correlation; 0 no correlation – Only strength of relationship not cause-effect Steps for making statistical predictions continued… • Having established a correlation (strength) – Use ‘coefficient of determination’ (r2) to assess what proportion (%) of the relationship is explained by the Pearson r correlation – Evaluate the statistical significance (t-scores) – i.e. set the risk level of accepting calculated coefficients against null hypothesis • The selection of scatter diagrams (next) illustrates linear correlation principles A selection of scatter diagrams and associated correlation coefficients r=+1 r=-1 16 r = + 0.871 25 25 14 y values y values 20 20 12 10 15 15 8 10 y values 5 10 6 4 0 5 0 2 4 6 x values 8 10 0 2 r = - 0.497 30 25 20 15 10 y values 5 0 0 4 6 x values 8 8 10 4 6 x values 8 10 8 10 r=0 25 20 20 15 15 10 y values 5 0 4 6 x values 2 r = + 0.0037 25 10 y values 5 2 0 10 0 0 2 4 6 x values 8 10 0 2 4 6 x values Now move on to prediction • From assessing the strength and power of a linear correlation between two variables • …move on to describing the nature of the relationship to assist in predicting The equation of a regression line has the form: Y = a + bX where Y is the dependent variable (the one we wish to predict / explain) and X is the independent variable. The value “a” is known as the intercept of the line and “b” measures the gradient of this line. Worked Example • LOS and age is correlated as r = 0.87207 from a survey of 30 employees in a firm • r (above) and r2 (0.760508) are strong – although this still leaves residuals at 24% (i.e. due to extraneous factors) • Is this significant? • Can we predict mean LOS at age 40? • What is the 95% confidence interval for LOS derived from one extra year age? Plotting the data we can see… 30 20 SERVICE 10 0 Rsq = 0.7605 10 20 30 40 50 60 AGE The equation of the line linking length of service (y) and age (x) is: Y = -8.2194 + 0.45727x and SPSS reveals these coefficients for us This equation can be used to predict LOS at a selected age. Where do the figures come from to drop into the Y=a+bX equation? An SPSS regression printout gives us the data needed to solve the problem: Variables Entered/Removed Model Variables Entered 1 AGE a All requested vari ables entered. b Depend ent Variable: LOS Model Summary Model R R Square 1 .872 .761 a Predictors: (Constant), AGE b Depend ent Variable: LOS Variables Removed . Adjusted R Square .752 Method Enter Std. Error of the Estimate 2.63 Coefficients Unstandardized Coefficients Model B 1 (Constant) -8.219 AGE .457 a Depend ent Variable: SERVICE Casewise Diagnostics Case Std. Residual Number 2 3.385 a Depend ent Variable: LOS Standardiz ed Coefficients Std. Error Beta 1.657 .048 .872 t -4.961 9.429 Sig. 95% Confi dence Interv al for B Lower Bound .000 .000 SERVICE Predicted Value Residua l 24 15.10 8.90 -11.613 .358 Upper Bound -4.826 .557 Interpretation of the SPSS output Variables Entered/Removed This simply tells us that ‘age’ was the independent variable and ‘service’ the dependent variable. Model summary The value of the correlation coefficient (r) was 0.872 and the value of r2 was 0.761. Coefficients The ‘unstandardized coefficients’ give us the values of a and b in the regression equation. Thus the equation here is y = -8.219 + 0.457x The final column ‘Sig.’ gives values less than 0.01 thus we can say that the coefficients of the regression equation are significantly different from zero at the 1% (0.01) level (and thus at 5% (0.05) level). Casewise diagnostics During the input dialogue, SPSS was asked to show any standardised residuals outside the range -3 to + 3. The output shows that one reading, case number 2, had a large standardised residual. This indicates that this point does not fit the general trend of the straight line and can be regarded as an ‘outlier’ (i.e. an unusual reading). The solution… Y = a + bX (where Y is LOS; X is age) Y = -8.2194 + 0.45727x Y = -8.2194 + 0.45727(40) Y = -8.2194 + 18.29 Y = 10.07 years’ service predicted at age 40* And … there is a 95 per cent probability that the mean additional LOS for each extra year in age lies in the range: 0.358 to 0.557 (as supplied in the SPSS output). * Have a glance back at the scattergram to check this visually Basic Quants: A Summary • • • • We have introduced the modelling concept We have reflected on data types/displays We have engaged with probability theory We have touched on – Significance testing of hypotheses using both parametric and non-parametric statistics – Prediction from what is known to make an informed estimate of the variable of interest » Work through the assignment with the booklet provided alongside and this will guide solution of every aspect!