Download Theory of Regression - Jeremy Miles`s Page

Applying Regression 1 The Course • 14 (or so) lessons – Some flexibility • Depends how we feel • What we get through 2 Part I: Theory of Regression 1. Models in statistics 2. Models with more than one parameter: regression 3. Samples to populations 4. Introducing multiple regression 5. More on multiple regression 3 Part 2: Application of regression 6. 7. 8. 9. 10. 11. 12. Categorical predictor variables Assumptions in regression analysis Issues in regression analysis Non-linear regression Categorical and count variables Moderators (interactions) in regression Mediation and path analysis Part 3:Taking Regression Further (Kind of brief) 13. Introducing longitudinal multilevel models 4 Bonuses Bonus lesson1: Why is it called regression? Bonus lesson 2: Other types of regression. 5 House Rules • Jeremy must remember – Not to talk too fast • If you don’t understand – Ask – Any time • If you think I’m wrong – Ask. (I’m not always right) 6 The Assistants • Carla Xena - [email protected] • Eugenia Suarez Moran [email protected] • Arian [email protected] Learning New Techniques • Best kind of data to learn a new technique – Data that you know well, and understand • Your own data – In computer labs (esp later on) – Use your own data if you like • My data – I’ll provide you with – Simple examples, small sample sizes • Conceptually simple (even silly) 8 Computer Programs • Stata – Mostly • I’ll explain SPSS options • You’ll like Stata more • Excel – For calculations – Semi-optional • GPower 9 Lesson 1: Models in statistics Models, parsimony, error, mean, OLS estimators 10 What is a Model? 11 What is a model? • Representation – Of reality – Not reality • Model aeroplane represents a real aeroplane – If model aeroplane = real aeroplane, it isn’t a model 12 • Statistics is about modelling – Representing and simplifying • Sifting – What is important from what is not important • Parsimony – In statistical models we seek parsimony – Parsimony  simplicity 13 Parsimony in Science • A model should be: – 1: able to explain a lot – 2: use as few concepts as possible • More it explains – The more you get • Fewer concepts – The lower the price • Is it worth paying a higher price for a better model? 14 The Mean as a Model 15 The (Arithmetic) Mean • We all know the mean – The ‘average’ – Learned about it at school – Forget (didn’t know) about how clever the mean is • The mean is: – An Ordinary Least Squares (OLS) estimator – Best Linear Unbiased Estimator (BLUE) 16 Mean as OLS Estimator • Going back a step or two • MODEL was a representation of DATA – We said we want a model that explains a lot – How much does a model explain? DATA = MODEL + ERROR ERROR = DATA - MODEL – We want a model with as little ERROR as possible 17 • What is error? Data (Y) Model (b0) mean Error (e) 1.40 -0.20 1.55 -0.05 1.80 1.60 0.20 1.62 0.02 1.63 0.03 18 • How can we calculate the ‘amount’ of error? • Sum of errors? • Sum of absolute errors? ERROR  ei  (Yi  Yˆ)  (Yi  b0 )   0.20    0.05   0.20  0.02  0.03 0 19 • Are small and large errors equivalent? – One error of 4 – Four errors of 1 – The same? – What happens with different data? • Y = (2, 2, 5) – b0 = 2 – Not very representative • Y = (2, 2, 4, 4) – b0 = any value from 2 - 4 – Indeterminate • There are an infinite number of solutions which would satisfy our criteria for minimum error 20 • Sum of squared errors (SSE) 2 i ERROR  e 2 ˆ  (Yi  Y )  (Yi  b0 ) 2   0.20    0.05   0.202  0.022  0.032 2 2  0.08 21 • Determinate – Always gives one answer • If we minimise SSE – Get the mean • Shown in graph – SSE plotted against b0 – Min value of SSE occurs when – b0 = mean 22 2 1.8 1.6 1.4 SSE 1.2 1 0.8 0.6 0.4 0.2 0 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 b0 23 The Mean as an OLS Estimate 24 Mean as OLS Estimate • The mean is an Ordinary Least Squares (OLS) estimate – As are lots of other things • This is exciting because – OLS estimators are BLUE – Best Linear Unbiased Estimators – Proven with Gauss-Markov Theorem • Which we won’t worry about 25 BLUE Estimators • Best – Minimum variance (of all possible unbiased estimators) – Narrower distribution than other estimators • e.g. median, mode Y Y 26 SSE and the Standard Deviation • Tying up a loose end 2 ˆ SSE  (Yi  Y ) s 2 ˆ (Yi  Y ) n  2 ˆ (Yi  Y ) n 1 27 • SSE closely related to SD • Sample standard deviation – s – Biased estimator of population SD • Population standard deviation -  – Need to know the mean to calculate SD • Reduces N by 1 • Hence divide by N-1, not N – Like losing one df 28 Proof • That the mean minimises SSE – Not that difficult – As statistical proofs go • Available in – Maxwell and Delaney – Designing experiments and analysing data – Judd and McClelland – Data Analysis: a model comparison approach • (out of print?) 29 What’s a df? • The number of parameters free to vary – When one is fixed • Term comes from engineering – Movement available to structures 30 Back to the Data • Mean has 5 (N) df – 1st moment •  has N –1 df – Mean has been fixed – 2nd moment – Can think of it as amount of cases vary away from the mean 31 While we are at it … • Skewness has N – 2 df – 3rd moment • Kurtosis has N – 3 df – 4rd moment – Amount cases vary from  32 Parsimony and df • Number of df remaining – Measure of parsimony • Model which contained all the data – Has 0 df – Not a parsimonious model • Normal distribution – Can be described in terms of mean and  • 2 parameters – (z with 0 parameters) 33 Summary of Lesson 1 • Statistics is about modelling DATA – Models have parameters – Fewer parameters, more parsimony, better • Models need to minimise ERROR – Best model, least ERROR – Depends on how we define ERROR – If we define error as sum of squared deviations from predicted value – Mean is best MODEL 34 Lesson 1a • A really brief introduction to Stata 35 Command review Commands Output Variable list Commands 36 Stata Commands • Can use menus – But commands are easy • All have similar format: • command variables , options • Stata is case sensitive – BEDS, beds, Beds • Stata lets you shorten – summarize sqft – su sq 37 More Stata Commands • Open exercise 1.4.dta – Run • summarize sqm • table beds • mean price • histogram price – Or • su be • tab be • mean pr • hist pr 38 Lesson 2: Models with one more parameter - regression 39 In Lesson 1 we said … • Use a model to predict and describe data – Mean is a simple, one parameter model 40 More Models Slopes and Intercepts 41 More Models • The mean is OK – As far as it goes – It just doesn’t go very far – Very simple prediction, uses very little information • We often have more information than that – We want to use more information than that 42 House Prices • Look at house prices in one area of Los Angeles • Predictors of house prices • Using: – Sale price, size, number of bedrooms, size of lot, year built … 43 House Prices address listprice beds baths sqft 3628 OLYMPIAD Dr 649500 4 3 2575 3673 OLYMPIAD Dr 450000 2 3 1910 3838 CHANSON Dr 489900 3 2 2856 3838 West 58TH Pl 330000 4 2 1651 3919 West 58TH Pl 349000 3 2 1466 3954 FAIRWAY Blvd 514900 3 2.25 2018 4044 OLYMPIAD Dr 649000 4 2.5 3019 4336 DON LUIS Dr 474000 2 2.5 2188 4421 West 59TH St 460000 3 2 1519 4518 WHELAN Pl 388000 2 1.5 1403 4670 West 63RD St 259500 3 2 1491 5000 ANGELES VISTA Blvd 678800 5 4 3808 46 One Parameter Model • The mean Y  415.69 Yˆ  b0  Y SSE  341683 “How much is that house worth?” $415,689 Use 1 df to say that 47 Adding More Parameters • We have more information than this – We might as well use it – Add a linear function of number of size (square feet) (x1) Yˆ  b0  b1 x1 48 Alternative Expression • Estimate of Y (expected value of Y) Yˆ  b0  b1 x1 • Value of Y Yi  b0  b1 xi1  ei 49 Estimating the Model • We can estimate this model in four different, equivalent ways – Provides more than one way of thinking about it 1. 2. 3. 4. Estimating the slope which minimises SSE Examining the proportional reduction in SSE Calculating the covariance Looking at the efficiency of the predictions 50 Estimate the Slope to Minimise SSE 51 Estimate the Slope • Stage 1 – Draw a scatterplot – x-axis at mean • Not at zero • Mark errors on it – Called ‘residuals’ – Sum and square these to find SSE 52 700 500 600 160 140 120 300 400 100 80 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60 200 40 20 1000 0 2000 3000 4000 SQFT LAST SALE PRICE Fitted values 53 160 140 120 100 80 60 40 20 0 1.5 2 2.5 3 3.5 4 4.5 5 5.5 • Add another slope to the chart – Redraw residuals – Recalculate SSE – Move the line around to find slope which minimises SSE • Find the slope 55 • First attempt: 56 • Any straight line can be defined with two parameters – The location (height) of the slope • b0 – Sometimes called a – The gradient of the slope • b1 57 • Gradient b1 units 1 unit 58 • Height b0 units 59 • Height • If we fix slope to zero – Height becomes mean – Hence mean is b0 • Height is defined as the point that the slope hits the y-axis – The constant – The y-intercept 60 • Why the constant? • beds (x1) x0 £ (000s) – b0x0 1 1 77 – Where x0 is 1.00 for every case 2 1 74 • i.e. x0 is constant 1 1 88 3 1 62 Implicit in Stata 5 1 90 – (And SPSS, SAS, R) 5 1 136 – Some packages force you to2 make1it 35 explicit 5 1 134 – (Later on we’ll need to make 4 it explicit) 1 138 1 1 55 61 • Why the intercept? – Where the regression line intercepts the yaxis – Sometimes called y-intercept 62 Finding the Slope • How do we find the values of b0 and b1? – Start with we jiggle the values, to find the best estimates which minimise SSE – Iterative approach • Computer intensive – used to matter, doesn’t really any more • (With fast computers and sensible search algorithms – more on that later) 63 • Start with – b0=416 (mean) – b1=0.5 (nice round number) • SSE = 365,774 – b0=300, b1=0.5, SSE=341,683 – b0=300, b1=0.6, SSE=310,240 – b0=300, b1=0.8, SSE=264,573 – b0=300, b1=1, SSE=301, 797 – b0=250, b1=1, SSE=255,366 – ….. 64 • Quite a long time later – b0 = 216.357 – b1 = 1.084 – SSE = 145,636.78 • Gives the position of the – Regression line (or) – Line of best fit • Better than guessing • Not necessarily the only method – But it is OLS, so it is the best (it is BLUE) 65 700 160 600 140 100 Price 400 500 120 80 300 60 Actual Price Predicted Price 40 200 20 0 1000 0.5 1 1.5 2000 2 2.5 3 SQFT 3.5 30004 4.5 5 4000 5.5 Number of Bedrooms Fitted values LAST SALE PRICE 66 • We now know – A zero square metre house is worth  $216,000 – Adding a square meter adds $1,080 • Told us two things – Don’t extrapolate to meaningless values of x-axis – Constant is not necessarily useful • It is necessary to estimate the equation 67 Exercise 2a, 2b 68 Standardised Regression Line • One big but: – Scale dependent • Values change – £ to €, inflation • Scales change – £, £000, £00? • Need to deal with this 69 • Don’t express in ‘raw’ units – Express in SD units – x1=183.82 – y=114.637 • b1 = 1.103 • We increase x1 by 1, and Ŷ increases by 1.084 1.084  (1.084 / 114.637) SDs  0.00945SDs • So we increase x1 by 1 and Ŷ increases by 0.0094 SDs 70 • Similarly, 1 unit of x1 = 1/69.017 SDs – Increase x1 by 1 SD – Ŷ increases by 1.103  (69.017/1) = 76.126 • Put them both together b1   x1 y 71 1.080  69.071  0.653 114.637 • The standardised regression line – Change (in SDs) in Ŷ associated with a change of 1 SD in x1 • A different route to the same answer – Standardise both variables (divide by SD) – Find line of best fit 72 • The standardised regression line has a special name The Correlation Coefficient (r) (r stands for ‘regression’, but more on that later) • Correlation coefficient is a standardised regression slope – Relative change, in terms of SDs 73 Exercise 2c 74 Proportional Reduction in Error 75 Proportional Reduction in Error • We might be interested in the level of improvement of the model – How much less error (as proportion) do we have – Proportional Reduction in Error (PRE) • Mean only – Error(model 0) = 341,683 • Mean + slope – Error(model 1) = 196,046 76 ERROR(0)  ERROR(1) PRE  ERROR(0) ERROR(1) PRE  1  ERROR(0) 196046 PRE  1  341683 PRE  0.426 77 • But we squared all the errors in the first place – So we could take the square root 0.426  0.653 • This is the correlation coefficient • Correlation coefficient is the square root of the proportion of variance explained 78 Standardised Covariance 79 Standardised Covariance • We are still iterating – Need a ‘closed-form’ equation – Equation to solve to get the parameter estimates • Answer is a standardised covariance – A variable has variance – Amount of ‘differentness’ • We have used SSE so far 80 • SSE varies with N – Higher N, higher SSE • Divide by N – Gives SSE per person (or house) – (Actually N – 1, we have lost a df to the mean) • Gives us the variance • Same as SD2 – We thought of SSE as a scattergram • Y plotted against X – (repeated image follows) 81 160 140 120 100 80 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60 40 20 0 82 • Or we could plot Y against Y – Axes meet at the mean (415) – Draw a square for each point – Calculate an area for each square – Sum the areas • Sum of areas – SSE • Sum of areas divided by N – Variance 83 Plot of Y against Y 180 160 140 120 100 0 20 40 60 80 80 100 120 140 160 180 60 40 20 0 84 Draw Squares 180 Area = 40.1 x 40.1 = 1608.1 160 138 – 88.9 = 40.1 140 138 – 88.9 = 40.1 120 100 0 20 35 – 88.9 = -53.9 40 60 80 80 100 120 140 160 180 60 40 35 – 88.9 = -53.9 20 Area = -53.9 x -53.9 = 2905.21 0 85 • What if we do the same procedure – Instead of Y against Y – Y against X • • • • Draw rectangles (not squares) Sum the area Divide by N - 1 This gives us the variance of x with y – The Covariance – Shortened to Cov(x, y) 86 87 Area = (-33.9) x (-2) = 67.8 55 – 88.9 = -33.9 138-88.9 = 49.1 4-3=1 1 - 3 = -2 Area = 49.1 x 1 = 49.1 88 • More formally (and easily) • We can state what we are doing as an equation – Where Cov(x, y) is the covariance ( x  x )( y  y ) Cov( x , y )  N 1 • Cov(x,y)=5165 • What do points in different sectors do to the covariance? 89 • Problem with the covariance – Tells us about two things – The variance of X and Y – The covariance • Need to standardise it – Like the slope • Two ways to standardise the covariance – Standardise the variables first • Subtract from mean and divide by SD – Standardise the covariance afterwards 90 • First approach – Much more computationally expensive • Too much like hard work to do by hand – Need to standardise every value • Second approach – Much easier – Standardise the final value only • Need the combined variance – Multiply two variances – Find square root (were multiplied in first place) 91 • Standardised covariance  Cov( x, y ) Var ( x )  Var ( y ) 5165  69.02  114.64  0.653 92 • The correlation coefficient – A standardised covariance is a correlation coefficient r Covariance  variance  variance  93 • Expanded … r  ( x  x )( y  y )    N 1   2 2  ( x  x ) ( y  y )     N 1   N 1 94 • This means … – We now have a closed form equation to calculate the correlation – Which is the standardised slope – Which we can use to calculate the unstandardised slope 95 We know that: r b1   x1 y We know that: b1  r  y x 1 96 b1  r  y x 1 0.659  114.64 b1  69.017 b1  1.080 • So value of b1 is the same as the iterative approach 97 • The intercept – Just while we are at it • The variables are centred at zero – We subtracted the mean from both variables – Intercept is zero, because the axes cross at the mean 98 • Add mean of y to the constant – Adjusts for centring y • Subtract mean of x – But not the whole mean of x – Need to correct it for the slope c  y  b1 x1 c  415.7  1.08  183.81 c  216.35 99 Accuracy of Prediction 100 One More (Last One) • We have one more way to calculate the correlation – Looking at the accuracy of the prediction • Use the parameters – b0 and b1 – To calculate a predicted value for each case 101 Beds 239.2 177.4 265.3 153.4 136.2 187.5 280.5 203.3 141.1 130.3 Actual Predicted Price Price 605.0 475.8 400.0 408.8 529.5 504.1 315.0 382.7 341.0 364.0 525.0 419.7 585.0 520.5 430.0 436.8 436.0 369.4 390.0 357.7 • Plot actual price against predicted price – From the model 102 600 140 500 Predicted Value 120 100 80 400 60 40 300 20 20 40 200 60 300 80 100 400Actual Value 500 LAST SALE PRICE 120 600 140 160 700 103 • r = 0.653 • The correlation between actual and predicted value • Seems a futile thing to do – And at this stage, it is – But later on, we will see why 104 Some More Formulae • For hand calculation r xy x 2 y 2 • Point biserial  M r y1  M y 0  PQ sd y 105 • Phi (f) – Used for 2 dichotomous variables Vote P Vote Q Homeowner A: 19 B: 54 Not homeowner C: 60 D:53 BC  AD r ( A  B)(C  D)( A  C )( B  D) 106 • Problem with the phi correlation – Unless Px= Py (or Px = 1 – Py) • Maximum (absolute) value is < 1.00 • Tetrachoric correlation can be used to correct this • Rank (Spearman) correlation – Used where data are ranked 6d r 2 n(n  1) 2 107 Summary • Mean is an OLS estimate – OLS estimates are BLUE • Regression line – Best prediction of outcome from predictor – OLS estimate (like mean) • Standardised regression line – A correlation 108 • Four ways to think about a correlation – 1. – 2. – 3. – 4. Standardised regression line Proportional Reduction in Error (PRE) Standardised covariance Accuracy of prediction 109 Regression and Correlation in Stata • Correlation: • correlate x y • correlate x y , cov • regress y x • Or • regress price sqm 110 Post-Estimation • Stata commands ‘leave behind’ something • You can run post-estimation commands – They mean ‘from the last regression’ • Get predicted values: – predict my_preds This comes after the comma, so it’s an option • Get residuals: – predict my_res, residuals 111 Graphs • Scatterplot • scatter price beds • Regression line – lfit price beds • Both graphs – twoway (scatter price beds) (lfit price beds) 112 • What happens if you run reg without a predictor? – regress price 113 Exercises 114 Lesson 3: Samples to Populations – Standard Errors and Statistical Significance 115 The Problem • In Social Sciences – We investigate samples • Theoretically – Randomly taken from a specified population – Every member has an equal chance of being sampled – Sampling one member does not alter the chances of sampling another • Not the case in (say) physics, biology, etc. 116 Population • But it’s the population that we are interested in – Not the sample – Population statistic represented with Greek letter – Hat means ‘estimate’ ˆ b x  ˆx 117 • Sample statistics (e.g. mean) estimate population parameters • Want to know – Likely size of the parameter – If it is > 0 118 Sampling Distribution • We need to know the sampling distribution of a parameter estimate – How much does it vary from sample to sample • If we make some assumptions – We can know the sampling distribution of many statistics – Start with the mean 119 Sampling Distribution of the Mean • Given – Normal distribution – Random sample – Continuous data • Mean has a known sampling distribution – Repeatedly sampling will give a known distribution of means – Centred around the true (population) mean () 120 Analysis Example: Memory • Difference in memory for different words – 10 participants given a list of 30 words to learn, and then tested – Two types of word • Abstract: e.g. love, justice • Concrete: e.g. carrot, table 121 Concrete Abstract 12 4 11 7 4 6 9 12 8 6 12 10 9 8 8 5 12 10 8 4 Diff (x) 8 4 -2 -3 2 2 1 3 2 4 x  2.1  x  3.11 N  10 122 Confidence Intervals • This means – If we know the mean in our sample – We can estimate where the mean in the population () is likely to be • Using – The standard error (se) of the mean – Represents the standard deviation of the sampling distribution of the mean 123 1 SD contains 68% Almost 2 SDs contain 95% 124 • We know the sampling distribution of the mean – t distributed if N < 30 – Normal with large N (>30) • Asymptotically normal • Know the range within means from other samples will fall – Therefore the likely range of  x se( x )  n 125 • Two implications of equation – Increasing N decreases SE • But only a bit (SE halfs if N is 400 times bigger) – Decreasing SD decreases SE • Calculate Confidence Intervals – From standard errors • 95% is a standard level of CI – 95% of samples the true mean will lie within the 95% CIs – In large samples: 95% CI = 1.96  SE – In smaller samples: depends on t distribution (df=N-1=9) 126 x  2.1,  x  3.11, N  10 x 3.11 se( x )    0.98 n 10 127 95% CI  2.26  0.98  2.22 x  CI    x  CI -0.12    4.32 128 What is a CI? • (For 95% CI): • 95% chance that the true (population) value lies within the confidence interval? No; • 95% of samples, true mean will land within the confidence interval? 129 Significance Test • Probability that  is a certain value – Almost always 0 • Doesn’t have to be though • We want to test the hypothesis that the difference is equal to 0 – i.e. find the probability of this difference occurring in our sample IF =0 – (Not the same as the probability that =0) 130 • Calculate SE, and then t – t has a known sampling distribution – Can test probability that a certain value is included x t se(x ) 2.1 t  2.14 0.98 p  0.061 131 Other Parameter Estimates • Same approach – Prediction, slope, intercept, predicted values – At this point, prediction and slope are the same • Won’t be later on • One predictor only – More complicated with > 1 132 Testing the Degree of Prediction • Prediction is correlation of Y with Ŷ – The correlation – when we have one IV • Use F, rather than t • Started with SSE for the mean only – This is SStotal – Divide this into SSresidual – SSregression • SStot = SSreg + SSres 133 F SSreg df1 SS res df 2 df1  k df 2  N  k  1, 134 • Back to the house prices – Original SSE (SStotal) = 341683 – SSresidual = 196046 • What is left after our model – SSregression = 341683– 196046= 145636 • What our model explains 135 SSreg df1 F SS res df 2 145636 1 F  18.57 196046 ( 25  1  1) df1  k  1 df 2  N  k  1  8 136 • F = 18.6, df = 1, 25, p = 0.0002 – Can reject H0 • H0: Prediction is not better than chance – A significant effect 137 Statistical Significance: What does a p-value (really) mean? 138 A Quiz • Six questions, each true or false • Write down your answers (if you like) • An experiment has been done. Carried out perfectly. All assumptions perfectly satisfied. Absolutely no problems. • P = 0.01 – Which of the following can we say? 139 1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). 140 2. You have found the probability of the null hypothesis being true. 141 3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). 142 4. You can deduce the probability of the experimental hypothesis being true. 143 5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. 144 6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. 145 OK, What is a p-value • Cohen (1994) “[a p-value] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe it does” (p 997). 146 OK, What is a p-value • Sorry, didn’t answer the question • It’s “The probability of obtaining a result as or more extreme than the result we have in the study, given that the null hypothesis is true” • Not probability the null hypothesis is true 147 A Bit of Notation • Not because we like notation – But we have to say a lot less • • • • Probability – P Null hypothesis is true – H Result (data) – D Given - | 148 What’s a P Value • P(D|H) – Probability of the data occurring if the null hypothesis is true • Not • P(H|D) (what we want to know) – Probability that the null hypothesis is true, given that we have the data = p(H) • P(H|D) ≠ P(D|H) 149 • What is probability you are prime minister – Given that you are British – P(M|B) – Very low • What is probability you are British – Given you are prime minister – P(B|M) – Very high • P(M|B) ≠ P(B|M) 150 • There’s been a murder – Someone murdered an instructor (perhaps they talked too much) • The police have DNA • The police have your DNA – They match(!) • DNA matches 1 in 1,000,000 people • What’s the probability you didn’t do the murder, given the DNA match (H|D) 151 • Police say: – P(D|H) = 1/1,000,000 • Luckily, you have Jeremy on your defence team • We say: – P(D|H) ≠ P(H|D) • Probability that someone matches the DNA, who didn’t do the murder – Incredibly high 152 Back to the Questions • Haller and Kraus (2002) – Asked those questions of groups in Germany – Psychology Students – Psychology lecturers and professors (who didn’t teach stats) – Psychology lecturers and professors (who did teach stats) 153 1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). • True • • • • • 34% of students 15% of professors/lecturers, 10% of professors/lecturers teaching statistics False We have found evidence against the null hypothesis 154 2. You have found the probability of the null hypothesis being true. – 32% of students – 26% of professors/lecturers – 17% of professors/lecturers teaching statistics • • False We don’t know 155 3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means). – – – • 20% of students 13% of professors/lecturers 10% of professors/lecturers teaching statistics False 156 4. You can deduce the probability of the experimental hypothesis being true. – 59% of students – 33% of professors/lecturers – 33% of professors/lecturers teaching statistics • False 157 5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. • • • • • 68% of students 67% of professors/lecturers 73% of professors professors/lecturers teaching statistics False Can be worked out – P(replication) 158 6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. – 41% of students – 49% of professors/lecturers – 37% of professors professors/lecturers teaching statistics • • False Another tricky one – It can be worked out 159 One Last Quiz • I carry out a study – All assumptions perfectly satisfied – Random sample from population – I find p = 0.05 • You replicate the study exactly – What is probability you find p < 0.05? 160 • I carry out a study – All assumptions perfectly satisfied – Random sample from population – I find p = 0.01 • You replicate the study exactly – What is probability you find p < 0.05? 161 • Significance testing creates boundaries and gaps where none exist. • Significance testing means that we find it hard to build upon knowledge – we don’t get an accumulation of knowledge 162 • Yates (1951) "the emphasis given to formal tests of significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the results of the tests of significance ... and too little to the estimates of the magnitude of the effects they are investigating 163 Testing the Slope • Same idea as with the mean – Estimate 95% CI of slope – Estimate significance of difference from a value (usually 0) • Need to know the SD of the slope – Similar to SD of the mean 164 s y. x  2 ˆ (Y  Y ) N  k 1 s y.x  SSres N  k 1 s y.x  5921  27.2 8 165 • Similar to equation for SD of mean • Then we need standard error - Similar (ish) • When we have standard error – Can go on to 95% CI – Significance of difference 166 se(by. x )  s y.x ( x  x ) 2 27.2 se(by. x )   5.24 26.9 167 • Confidence Limits • 95% CI – t dist with N - k - 1 df is 2.31 – CI = 5.24  2.31 = 12.06 • 95% confidence limits 14.8  12.1    14.8  12.1 2.7    26.9 168 • Significance of difference from zero – i.e. probability of getting result if =0 • Not probability that  = 0 b 14.7 t   2.81 se(b) 5.2 df  N  k  1  8 p  0.02 • This probability is (of course) the same as the value for the prediction 169 Testing the Standardised Slope (Correlation) • Correlation is bounded between –1 and +1 – Does not have symmetrical distribution, except around 0 • Need to transform it – Fisher z’ transformation – approximately normal z  0.5[ln( 1  r )  ln( 1  r )] 1 SE z  n3 170 z  0.5[ln( 1  0.706)  ln( 1  0.706)] z  0.879 1 1 SEz    0.38 n3 10  3 • 95% CIs – 0.879 – 1.96 * 0.38 = 0.13 – 0.879 + 1.96 * 0.38 = 1.62 171 • Transform back to correlation e 1 r  2y e 1 2y • 95% CIs = 0.13 to 0.92 • Very wide – Because of small sample size – Maybe that’s why CIs are not reported? 172 Using Excel • Functions in excel – Fisher() – to carry out Fisher transformation – Fisherinv() – to transform back to correlation 173 The Others • Same ideas for calculation of CIs and SEs for – Predicted score – Gives expected range of values given X • Same for intercept – But we have probably had enough 174 One more tricky thing • (Don’t worry if you don’t understand) • For means, regression estimates, etc – Estimate • 1.0000 – 95% confidence intervals • 0.0000, 2.0000 – P = 0.05000 • They match 175 • For correlations, odds ratios, etc – No longer match • 95% CIs – 0.0000, 0.50000 • P-value – 0.052000 • Because of the sampling distribution of the mean – Does not depend on the value • The sampling distribution of a proportion – Does depend on the value – More certainty around 0.9 than around 0.00. 176 Lesson 4: Introducing Multiple Regression 177 Residuals • We said Y = b0 + b1x1 • We could have said Yi = b0 + b1xi1 + ei • We ignored the i on the Y • And we ignored the ei – It’s called error, after all • But it isn’t just error – Trying to tell us something 178 What Error Tells Us • Error tells us that a case has a different score for Y than we predict – There is something about that case • Called the residual – What is left over, after the model • Contains information – Something is making the residual  0 – But what? 179 700 160 600 140 100 Price 400 500 120 80 300 60 Actual Price Predicted Price 40 200 20 0 1000 0.5 1 1.5 2000 2 2.5 3 SQFT 3.5 30004 Number of Bedrooms LAST SALE PRICE 4.5 Fitted values 5 4000 5.5 181 • The residual (+ the mean) is the expected value of Y If all cases were equal on X • It is the value of Y, controlling for X • Other words: – Holding constant – Partialling – Residualising (residualised scores) – Conditioned on 182 • Sometimes adjustment is enough on its own – Measure performance against criteria • Teenage pregnancy rate – Measure pregnancy and abortion rate in areas – Control for socio-economic deprivation, religion, rural/urban and anything else important – See which areas have lower teenage pregnancy and abortion rate, given same level of deprivation • Value added education tables – Measure school performance – Control for initial intake 183 Adj Value (mean + resid) Sqm Price Predicted Residual 239.2 605.0 475.77 129.23 544.8 177.4 400.0 408.78 -8.78 406.8 265.3 529.5 504.08 25.42 441.0 153.4 315.0 382.69 -67.69 347.9 136.2 341.0 364.05 -23.05 392.6 187.5 525.0 419.66 105.34 520.9 280.5 585.0 520.51 64.49 480.1 203.3 430.0 436.79 -6.79 408.8 141.1 436.0 369.39 66.61 482.2 130.3 390.0 357.70 32.30 447.9 184 Control? • In experimental research – Use experimental control – e.g. same conditions, materials, time of day, accurate measures, random assignment to conditions • In non-experimental research – Can’t use experimental control – Use statistical control instead 185 Analysis of Residuals • What predicts differences in crime rate – After controlling for socio-economic deprivation – Number of police? – Crime prevention schemes? – Rural/Urban proportions? – Something else • This is (mostly) what multiple regression is about 186 • Exam performance – Consider number of books a student read (books) – Number of lectures (max 20) a student attended (attend) • Books and attend as IV, grade as outcome 187 Book s Attend 0 1 0 2 4 4 1 4 3 0 9 15 10 16 10 20 11 20 15 15 Grade 45 57 45 51 65 88 44 87 89 59 First 10 cases 188 • Use books as IV – R=0.492, F=12.1, df=1, 28, p=0.001 – b0=52.1, b1=5.7 – (Intercept makes sense) • Use attend as IV – R=0.482, F=11.5, df=1, 38, p=0.002 – b0=37.0, b1=1.9 – (Intercept makes less sense) 189 100 90 80 70 Grade (100) 60 50 40 30 -1 0 1 2 3 4 5 Books 190 100 90 80 70 60 Grade 50 40 30 5 7 9 11 13 15 17 19 21 Attend 191 Problem • Use R2 to give proportion of shared variance – Books = 24% – Attend = 23% • So we have explained 24% + 23% = 47% of the variance – NO!!!!! 192 • Look at the correlation matrix BOOKS 1 ATTEND 0.44 1 GRADE 0.49 0.48 1 BOOKS ATTEND GRADE • Correlation of books and attend is (unsurprisingly) not zero – Some of the variance that books shares with grade, is also shared by attend 193 • I have access to 2 cars • My wife has access to 2 cars – We have access to four cars? – No. We need to know how many of my 2 cars are shared • Similarly with regression – But we can do this with the residuals – Residuals are what is left after (say) books – See if residual variance is explained by attend – Can use this new residual variance to calculate SSres, SStotal and SSreg 194 • Well. Almost. – This would give us correct values for SS – Would not be correct for slopes, etc • Because assumes that the variables have a causal priority – Why should attend have to take what is left from books? – Why should books have to take what is left by attend? • Use OLS again; take variance they share 195 • Simultaneously estimate 2 parameters – b1 and b2 – Y = b0 + b1x1 + b2x2 – x1 and x2 are IVs • Shared variance • Not trying to fit a line any more – Trying to fit a plane • Can solve iteratively – Closed form equations better – But they are unwieldy 196 3D scatterplot (2points only) y x2 x1 197 b2 y b1 b0 x2 x1 198 199 Increasing Power • What if the predictors don’t correlate? • Regression is still good – It increases the power to detect effects – (More on power later) • Less variance left over • When do we know the two predictors don’t correlate? 200 (Really) Ridiculous Equations 2   y  y x1  x1 x2  x2     y  y x2  x2 x1  x1 x2  x2  b1  2 2 2 x1  x1  x2  x2   x1  x1 x2  x2  2   y  y x2  x2 x1  x1     y  y x1  x1 x2  x2 x1  x1  b2  2 2 2 x2  x2  x1  x1   x2  x2 x1  x1  b0  y  b1 x1  b2 x2 201 • The good news – There is an easier way • The bad news – It involves matrix algebra • The good news – We don’t really need to know how to do it 202 • We’re not programming computers – So we usually don’t care • Very, very occasionally it helps to know what the computer is doing 203 Back to the Good News • We can calculate the standardised parameters as B=Rxx-1 x Rxy • Where – B is the vector of regression weights – Rxx-1 is the inverse of the correlation matrix of the independent (x) variables – Rxy is the vector of correlations of the correlations of the x and y variables 204 Exercise 4.2 205 Exercises • Exercise 4.1 – Grades data in Excel • Exercise 4.2 – Repeat in Stata • Exercise 4.3 – Zero correlation • Exercise 4.4 – Repeat therapy data • Exercise 4.5 – PTSD in families. 206 Lesson 5: More on Multiple Regression 207 Contents • More on parameter estimates – Standard errors of coefficients • R, R2, adjusted R2 • Extra bits – Suppressors – Decisions about control variables – Standardized estimates > 1 – Variable entry techniques 208 More on Parameter Estimates 209 Parameter Estimates • Parameter estimates (b1, b2 … bk) were standardised – Because we analysed a correlation matrix • Represent the correlation of each IV with the outcome – When all other IVs are held constant 210 • Can also be unstandardised • Unstandardised represent the unit (rather than SD’s) change in the outcome associated with a 1 unit change in the IV – When all the other variables are held constant • Parameters have standard errors associated with them – As with one IV – Hence t-test, and associated probability can be calculated • Trickier than with one IV 211 Standard Error of Regression Coefficient • Standardised is easier 1 R 1 SEi  2 n  k 1 1  R i 2 Y – R2i is the value of R2 when all other predictors are used as predictors of that variable • Note that if R2i = 0, the equation is the same as for previous 212 Multiple R 213 Multiple R • The degree of prediction – R (or Multiple R) – No longer equal to b • R2 Might be equal to the sum of squares of B – Only if all x’s are uncorrelated 214 In Terms of Variance • Can also think of R2 in terms of variance explained. – Each IV explains some variance in the outcome – The IVs share some of their variance • Can’t share the same variance twice 215 Variance in Y accounted for by x1 rx1y2 = 0.36 The total variance of Y =1 Variance in Y accounted for by x2 rx2y2 = 0.36 216 • In this model – R2 = ryx12 + ryx22 – R2 = 0.36 + 0.36 = 0.72 – R = 0.72 = 0.85 • But – If x1 and x2 are correlated – No longer the case 217 Variance in Y accounted for by x1 rx1y2 = 0.36 The total variance of Y =1 Variance shared between x1 and x2 (not equal to rx1x2) Variance in Y accounted for by x2 rx2y2 = 0.36 218 • So – We can no longer sum the r2 – Need to sum them, and subtract the shared variance – i.e. the correlation • But – It’s not the correlation between them – It’s the correlation between them as a proportion of the variance of Y • Two different ways 219 • Based on estimates 2 R  b1ryx1  b2 ryx2 • If rx1x2 = 0 – rxy = bx1 – Equivalent to ryx12 + ryx22 220 • Based on correlations 2 R  2 yx1 r 2 yx2 r  2ryx1 ryx2 rx1 x2 2 x1 x2 1r • rx1x2 = 0 – Equivalent to ryx12 + ryx22 221 • Can also be calculated using methods we have seen – Based on PRE (predicted value) – Based on correlation with prediction • Same procedure with >2 IVs 222 Adjusted R2 • R2 is on average an overestimate of population value of R2 – Any x will not correlate 0 with Y – Any variation away from 0 increases R – Variation from 0 more pronounced with lower N • Need to correct R2 – Adjusted R2 223 • Calculation of Adj. R2 N 1 Adj. R  1  (1  R ) N  k 1 2 2 • 1 – R2 – Proportion of unexplained variance – We multiple this by an adjustment • More variables – greater adjustment • More people – less adjustment 224 N 1 N  k 1 N  20, k  3 20  1 19   1.1875 20  3  1 16 N  10, k  8 N  10, k  3 10  1 9  9 10  8  1 1 10  1 9   1.5 10  3  1 6 225 Extra Bits • Some stranger things that can happen – Counter-intuitive 226 Suppressor variables • Can be hard to understand – Very counter-intuitive • Definition – A predictor which increases the size of the parameters associated with other predictors above the size of their correlations 227 • An example (based on Horst, 1941) – Success of trainee pilots – Mechanical ability (x1), verbal ability (x2), success (y) • Correlation matrix Mech Mech Verb Success 1 0.5 0.3 Verb 0.5 1 0 Success 0.3 0 1 228 – Mechanical ability correlates 0.3 with success – Verbal ability correlates 0.0 with success – What will the parameter estimates be? – (Don’t look ahead until you have had a guess) 229 • Mechanical ability – b = 0.4 – Larger than r! • Verbal ability – b = -0.2 – Smaller than r!! • So what is happening? – You need verbal ability to do the mechanical ability test – Not actually related to mechanical ability • Measure of mechanical ability is contaminated by verbal ability 230 • High mech, low verbal – High mech • This is positive (.4) – Low verbal • Negative, because we are talking about standardised scores (-(-.2)  (.2) • Your mech is really high – you did well on the mechanical test, without being good at the words • High mech, high verbal – Well, you had a head start on mech, because of verbal, and need to be brought down a bit 231 Another suppressor? x1 x2 y x1 1 0.5 0.3 x2 0.5 1 0.2 y 0.3 0.2 1 b1 = b2 = 232 Another suppressor? x1 x2 y x1 1 0.5 0.3 x2 0.5 1 0.2 y 0.3 0.2 1 b1 =0.26 b2 = -0.06 233 And another? x1 x2 y x1 1 0.5 0.3 x2 0.5 1 -0.2 y 0.3 -0.2 1 b1 = b2 = 234 And another? x1 x2 y x1 1 0.5 0.3 x2 0.5 1 -0.2 y 0.3 -0.2 1 b1 = 0.53 b2 = -0.47 235 One more? x1 x2 y x1 1 -0.5 0.3 x2 -0.5 1 0.2 y 0.3 0.2 1 b1 = b2 = 236 One more? x1 x2 y x1 1 -0.5 0.3 x2 -0.5 1 0.2 y 0.3 0.2 1 b1 = 0.53 b2 = 0.47 237 • Suppression happens when two opposing forces are happening together – And have opposite effects • Don’t throw away your IVs, – Just because they are uncorrelated with the outcome • Be careful in interpretation of regression estimates – Really need the correlations too, to interpret what is going on – Cannot compare between studies with different predictors – Think about what you want to know • Before throwing variables into the analysis 238 What to Control For? • What is the added value of a ‘better’ college – In terms of salary – More academic people go to ‘better’ colleges – Control for: • Ability? Social class? Mother’s education? Parent’s income? Course? Ethnic group? … 239 • Decisions about control variables – Guided from theory • Effect of gender – Controlling for hair length and skirt wearing? 240 241 • Do dogs make kids healthier? – What to control for? Parent’s weight? • Yes: Obese parents are more likely to have obese kids, kids who are thinner, relative to the parents are thinner. • No: Dog might make parent thinner. By controlling for parental weight, you’re controlling for the effect of dog 242 Bad control vars Bad control vars Kid’s health Dog Good control vars Parent Weight Child Asthma Kid’s health Dog Rural/Urban? House/apartment? Income Standardised Estimates > 1 • Correlations are bounded -1.00 ≤ r ≤ +1.00 – We think of standardised regression estimates as being similarly bounded • But they are not – Can go >1.00, <-1.00 – R cannot, because that is a proportion of variance 246 • Three measures of ability – Mechanical ability, verbal ability 1, verbal ability 2 – Score on science exam Mech Mech Verbal1 Verbal2 Scores 1 0.1 0.1 0.6 Verbal1 0.1 1 0.9 0.6 Verbal2 0.1 0.9 1 0.3 Scores 0.6 0.6 0.3 1 –Before reading on, what are the parameter estimates? 247 Mech Verbal1 Verbal2 0.56 1.71 -1.29 • Mechanical – About where we expect • Verbal 1 – Very high • Verbal 2 – Very low 248 • What is going on – It’s a suppressor again – a predictor which increases the size of the parameters associated with other predictors above the size of their correlations • Verbal 1 and verbal 2 are correlated so highly – They need to cancel each other out 249 Variable Selection • What are the appropriate predictors to use in a model? – Depends what you are trying to do • Multiple regression has two separate uses – Prediction – Explanation 250 • Prediction – What will happen in the future? – Emphasis on practical application – Variables selected (more) empirically – Value free • Explanation – Why did something happen? – Emphasis on understanding phenomena – Variables selected theoretically – Not value free 251 • Visiting the doctor – Precedes suicide attempts – Predicts suicide • Does not explain suicide • More on causality later on … • Which are appropriate variables – To collect data on? – To include in analysis? – Decision needs to be based on theoretical knowledge of the behaviour of those variables – Statistical analysis of those variables (later) • Unless you didn’t collect the data – Common sense (not a useful thing to say) 252 Variable Entry Techniques • Entry-wise – All variables entered simultaneously • Hierarchical – Variables entered in a predetermined order • Stepwise – Variables entered according to change in R2 – Actually a family of techniques 253 • Entrywise regression – All variables entered simultaneously – All treated equally • Hierarchical regression – Entered in a theoretically determined order – Change in R2 is assessed, and tested for significance – e.g. sex and age • Should not be treated equally with other variables • Sex and age MUST be first (unchangeable) – Confused with hierarchical linear modelling (MLM) 254 R-Squared Change  SSE1  SSE0  /( df1  df 0 ) F SSE0 / df 0 • SSE0, df0 • SSE and df for first (smaller) model • SSE1, df1 • SSE and df for second (larger) model 255 • Stepwise – Variables entered empirically – Variable which increases R2 the most goes first • Then the next … – Variables which have no effect can be removed from the equation • Example – House prices – what’s important? – Size, lot size, list price, 256 • Stepwise Analysis – Data determines the order – Model 1: listing price, R2 = 0.87 – Model 2: listing price + lot size, R2 = 0.89 List Lot size b 0.81 0.02 p <0.001 0.02 257 • Hierarchical analysis – Theory determines the order – Model 1: Lot size+ House size, R2 = 0.499 – Model 2: + List price, R2 = 0.905 – Change in R2 = 0.41, p < 0.001 2 House size Lot size List price 0.18 0.15 0.75 0.20 0.03 <0.001 258 • Which is the best model? – Entrywise – OK – Stepwise – excluded age • Excluded size – MOST IMPORTANT PREDICTOR – Hierarchical • Listing price accounted for additional variance – Whoever decides the price has information that we don’t • Other problems with stepwise – F and df are wrong (cheats with df) – Unstable results • Small changes (sampling variance) – large differences in models 259 – Uses a lot of paper – Don’t use a stepwise procedure to pack your suitcase 260 Is Stepwise Always Evil? • Yes • All right, no • Research goal is entirely predictive (technological) – Not explanatory (scientific) – What happens, not why • N is large – 40 people per predictor, Cohen, Cohen, Aiken, West (2003) • Cross validation takes place 261 • Alternatives to stepwise regression – More recently developed – Used for genetic studies • 1000s of predictors, one outcome, small samples – Least Angle Regression • LARS (least angle regression) • Lasso (Least absolute shrinkage and selection operator) 262 Entry Methods in Stata • Entrywise – What regress does • Hierarchical – Two ways – Use hireg – Add on module • net search hireg • Then install 263 Hierarchical Regression • Use (on one line) – hireg outcome (block1var1 block1var2) (block2var1 block2var2) • Hireg reports – Parameter estimates for the two regressions – R2 for each model, change in R2 264 Model R2 1: 0.022 2: 0.513 p 0.136 0.000 F(df) 2.256(1,98) 50.987(2,97) R2 change 0.490 P value for the R2 F(df)change 97.497(1,97) p 0.000 P value for the change in R2 265 Hierarchical Regression (Cont…) • I don’t like hireg, for two reasons – It’s different to regression – It only works for OLS regression, not logistic, multinomial, Poisson, etc • Alternative 2: – Use test – The p-value associated with the change in R2 for a variable • Equal to the p-value for that variable. 266 Hierarchical Regression (Cont…) • Example (using cars) – Parameters from final model: – hireg price () (extro) car | extro | Coef. .463 Std. Err. .1296 t 3.57 P>|t| 0.001 [95% Conf. Interval] .2004 .72626 – R2 change statistics R2 change 0.128 F(df) change 12.773(1,36) p 0.001 – (What is relationship between t and F?) • We know the p-value of the R2 change – When there is one predictor in the block – What about when there’s more than one? 267 Hierarchical Regression (Cont) • test isn’t exactly what we want – But it is the same as what we want • Advantage of test – You can always use it • (I can always remember how it works) 268 (For SPSS) • SPSS calls them ‘blocks’ • Enter some variables, click ‘next block’ – Enter more variables • Click on ‘Statistics’ – Click on R-squared change 269 Stepwise Regression • Add stepwise: prefix • With – Pr() – probability value to be removed from equation – Pe() – probability value to be entered into equation • stepwise, pe(0.05) pr(0.2): reg price sqm lotsize originallis 270 A quick note on R2 R2 is sometimes regarded as the ‘fit’ of a regression model – Bad idea • If good fit is required – maximise R2 – Leads to entering variables which do not make theoretical sense 271 Propensity Scores • Another method of controlling for variables • Ensure that predictors are uncorrelated with one predictor – Don’t need to control for them 272 x’s Uncorrelated? • Two cases when x’s are uncorrelated • Experimental design – Predictors are uncorrelated – We randomly assigned people to conditions to ensure that was the case • Sample weights – We can deliberately sample • Ensure that they are uncorrelated 273 • 20 • 20 • 20 • 20 women with college degree women without college degree men with college degree men without college degree – Or use post hoc sample weights • Propensity weighting – Weight to ensure that variables are uncorrelated – Usually done to avoid having to control – E.g. ethnic differences in PTSD symptoms – Can incorporate many more control variables • 100+ 274 Propensity Scores • Race profiling of police stops – Same time, place, area, etc – www.youtube.com/watch?v=Oot0BOaQTZI 275 Critique of Multiple Regression • Goertzel (2002) – “Myths of murder and multiple regression” – Skeptical Inquirer (Paper B1) • Econometrics and regression are ‘junk science’ – Multiple regression models (in US) – Used to guide social policy 276 More Guns, Less Crime – (controlling for other factors) • Lott and Mustard: A 1% increase in gun ownership – 3.3% decrease in murder rates • But: – More guns in rural Southern US – More crime in urban North (crack cocaine epidemic at time of data) 277 Executions Cut Crime • No difference between crimes in states in US with or without death penalty • Ehrlich (1975) controlled all variables that affect crime rates – Death penalty had effect in reducing crime rate • No statistical way to decide who’s right 278 Legalised Abortion • Donohue and Levitt (1999) – Legalised abortion in 1970’s cut crime in 1990’s • Lott and Whitley (2001) – “Legalising abortion decreased murder rates by … 0.5 to 7 per cent.” • It’s impossible to model these data – Controlling for other historical events – Crack cocaine (again) 279 • Crime is still dropping in the US – Despite the recession • Levitt says it’s mysterious, because the abortion effect should be over • Some suggest Xboxes, Playstations, etc • Netflix, DVRs – (Violent movies reduce crime). 280 Another Critique • Berk (2003) – Regression analysis: a constructive critique (Sage) • Three cheers for regression – As a descriptive technique • Two cheers for regression – As an inferential technique • One cheer for regression – As a causal analysis 281 Is Regression Useless? • Do regression carefully – Don’t go beyond data which you have a strong theoretical understanding of • Validate models – Where possible, validate predictive power of models in other areas, times, groups • Particularly important with stepwise 282 Lesson 6: Categorical Predictors 283 Introduction 284 Introduction • So far, just looked at continuous predictors • Also possible to use categorical (nominal, qualitative) predictors – e.g. Sex; Job; Religion; Region; Type (of anything) • Usually analysed with t-test/ANOVA 285 Historical Note • But these (t-test/ANOVA) are special cases of regression analysis – Aspects of General Linear Models (GLMs) • So why treat them differently? – Fisher’s fault – Computers’ fault • Regression, as we have seen, is computationally difficult – Matrix inversion and multiplication – Can’t do it, without a computer 286 • In the special cases where: • You have one categorical predictor • Your IVs are uncorrelated – It is much easier to do it by partitioning of sums of squares • These cases – Very rare in ‘applied’ research – Very common in ‘experimental’ research • Fisher worked at Rothamsted agricultural research station • Never have problems manipulating wheat, pigs, cabbages, etc 287 • In psychology – Led to a split between ‘experimental’ psychologists and ‘correlational’ psychologists – Experimental psychologists (until recently) would not think in terms of continuous variables • Still (too) common to dichotomise a variable – Too difficult to analyse it properly – Equivalent to discarding 1/3 of your data 288 The Approach 289 The Approach • Recode the nominal variable – Into one, or more, variables to represent that variable • Names are slightly confusing – Some texts talk of ‘dummy coding’ to refer to all of these techniques – Some (most) refer to ‘dummy coding’ to refer to one of them – Most have more than one name 290 • If a variable has g possible categories it is represented by g-1 variables • Simplest case: – Smokes: Yes or No – Variable 1 represents ‘Yes’ – Variable 2 is redundant • If it isn’t yes, it’s no 291 The Techniques 292 • We will examine two coding schemes – Dummy coding • For two groups • For >2 groups – Effect coding • For >2 groups • Look at analysis of change – Equivalent to ANCOVA – Pretest-posttest designs 293 Dummy Coding – 2 Groups • Sometimes called ‘simple coding’ • A categorical variable with two groups • One group chosen as a reference group – The other group is represented in a variable • e.g. 2 groups: Experimental (Group 1) and Control (Group 0) – Control is the reference group – Dummy variable represents experimental group • Call this variable ‘group1’ 294 • For variable ‘group1’ – 1 = ‘Yes’, 0=‘No’ Original Category Exp Con New Variable 1 0 295 • Some data • Group is x, score is y Control Group Experiment 1 Experiment 2 Experiment 3 Experimental Group 10 10 10 20 10 30 296 • Control Group = 0 – Intercept = Score on Y when x = 0 – Intercept = mean of control group • Experimental Group = 1 – b = change in Y when x increases 1 unit – b = difference between experimental group and control group 297 35 30 Gradient of slope 25 represents difference between means 20 15 10 5 0 Control Group Experiment 1 Experimental Group Experiment 2 Experiment 3 298 Dummy Coding – 3+ Groups • With three groups the approach is the similar • g = 3, therefore g-1 = 2 variables needed • 3 Groups – Control – Experimental Group 1 – Experimental Group 2 299 Original Category Con Gp1 Gp2 Gp1 Gp2 0 1 0 0 0 1 • Recoded into two variables – Note – do not need a 3rd variable • If we are not in group 1 or group 2 MUST be in control group • 3rd variable would add no information • (What would happen to determinant?) 300 • F and associated p – Tests H0 that g1  g 2  g3 • b1 and b2 and associated p-values – Test difference between each experimental group and the reference group • To test difference between experimental groups – Need to rerun analysis (or just do ANOVA with post-hoc tests) 301 • One more complication – Have now run multiple comparisons – Increases a – i.e. probability of type I error • Need to correct for this – Bonferroni correction – Multiply given p-values by two/three (depending how many comparisons were made) 302 Effect Coding • Usually used for 3+ groups • Compares each group (except the reference group) to the mean of all groups – Dummy coding compares each group to the reference group. • Example with 5 groups – 1 group selected as reference group • Group 5 303 • Each group (except reference) has a variable – 1 if the individual is in that group – 0 if not – -1 if in reference group group 1 2 3 4 5 group_1 group_2 group_3 group_4 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 -1 -1 -1 -1 304 Examples • Dummy coding and Effect Coding • Group 1 chosen as reference group each time • Data Group Mean SD 1 52.40 4.60 2 56.30 5.70 3 60.10 5.00 Total 56.27 5.88 305 • Dummy Group dummy2 dummy3 1 2 3 0 1 0 0 0 1 Group Effect2 effect3 1 2 3 -1 1 0 -1 0 1 • Effect 306 Dummy R=0.543, F=5.7, df=2, 27, p=0.009 b0 = 52.4, b1 = 3.9, p=0.100 b2 = 7.7, p=0.002 Effect R=0.543, F=5.7, df=2, 27, p=0.009 b0 = 56.27, b1 = 0.03, p=0.980 b2 = 3.8, p=0.007 b0  g1 b0  G b1  g2  g1 b1  g2  G b2  g3  g1 b2  g3  G 307 In Stata • Use xi: prefix for dummy coding • Use xi3: module for more codings • But – I don’t like it, I do it by hand – I don’t understand what it’s doing – It makes very long variables • And then I can’t use test – BUT: If doing stepwise, you need to keep the variables together • Example: xi: reg outcome contpred i.catpred This has changed in Stata 11. xi: no longer needed Put i. in front of categorical 308 predictors xi: reg salary i.job_description -----------------------------------------------------salary | Coef. Std. Err. t P>|t| -------------+---------------------------------------_Ijob_desc~2 | 3100.34 2023.76 1.53 0.126 _Ijob_desc~3 | 36139.2 1228.352 29.42 0.000 _cons | 27838.5 532.4865 52.28 0.000 ------------------------------------------------------ 309 Exercise 6.1 • 5 golf balls – Which is best? 310 In SPSS • SPSS provides two equivalent procedures for regression – Regression – GLM – GLM will: – Automatically code categorical variables – Automatically calculate interaction terms – Allow you to not understand • GLM won’t: – Give standardised effects – Give hierarchical R2 p-values 311 ANCOVA and Regression 312 • Test – (Which is a trick; but it’s designed to make you think about it) • Use bank data (Ex 5.3) – Compare the pay rise (difference between salbegin and salary) – For ethnic minority and non-minority staff • What do you find? 313 ANCOVA and Regression • Dummy coding approach has one special use – In ANCOVA, for the analysis of change • Pre-test post-test experimental design – Control group and (one or more) experimental groups – Tempting to use difference score + t-test / mixed design ANOVA – Inappropriate 314 • Salivary cortisol levels – Used as a measure of stress – Not absolute level, but change in level over day may be interesting • Test at: 9.00am, 9.00pm • Two groups – High stress group (cancer biopsy) • Group 1 – Low stress group (no biopsy) • Group 0 315 High Stress Low Stress AM 20.1 22.3 PM 6.8 11.8 Diff 13.3 10.5 • Correlation of AM and PM = 0.493 (p=0.008) • Has there been a significant difference in the rate of change of salivary cortisol? – 3 different approaches 316 • Approach 1 – find the differences, do a t-test – t = 1.31, df=26, p=0.203 • Approach 2 – mixed ANOVA, look for interaction effect – F = 1.71, df = 1, 26, p = 0.203 – F = t2 • Approach 3 – regression (ANCOVA) based approach 317 – IVs: AM and group – outcome: PM – b1 (group) = 3.59, standardised b1=0.432, p = 0.01 • Why is the regression approach better? – The other two approaches took the difference – Assumes that r = 1.00 – Any difference from r = 1.00 and you add error variance • Subtracting error is the same as adding error 318 • Using regression – Ensures that all the variance that is subtracted is true – Reduces the error variance • Two effects – Adjusts the means • Compensates for differences between groups – Removes error variance • Data is am-pm cortisol 319 More on Change • If difference score is correlated with either pre-test or post-test – Subtraction fails to remove the difference between the scores – If two scores are uncorrelated • Difference will be correlated with both • Failure to control – Equal SDs, r = 0 • Correlation of change and pre-score =0.707 320 Even More on Change • A topic of surprising complexity – What I said about difference scores isn’t always true • Lord’s paradox – it depends on the precise question you want to answer – Collins and Horn (1993). Best methods for the analysis of change – Collins and Sayer (2001). New methods for the analysis of change – More later 321 Lesson 7: Assumptions in Regression Analysis 322 The Assumptions 1. The distribution of residuals is normal (at each value of the outcome). 2. The variance of the residuals for every set of values for the predictor is equal. • violation is called heteroscedasticity. 3. The error term is additive • no interactions. 4. At every value of the outcome the expected (mean) value of the residuals is zero • No non-linear relationships 323 5. The expected correlation between residuals, for any two cases, is 0. • The independence assumption (lack of autocorrelation) 6. All predictors are uncorrelated with the error term. 7. No predictors are a perfect linear function of other predictors (no perfect multicollinearity) 8. The mean of the error term is zero. 324 What are we going to do … • Deal with some of these assumptions in some detail • Deal with others in passing only – look at them again later on 325 Assumption 1: The Distribution of Residuals is Normal at Every Value of the outcome 326 Look at Normal Distributions • A normal distribution – symmetrical, bell-shaped (so they say) 327 What can go wrong? • Skew – non-symmetricality – one tail longer than the other • Kurtosis – too flat or too peaked – kurtosed • Outliers – Individual cases which are far from the distribution 328 Effects on the Mean • Skew – biases the mean, in direction of skew • Kurtosis – mean not biased – standard deviation is – and hence standard errors, and significance tests 329 Examining Univariate Distributions • Graphs – Histograms – Boxplots – P-P plots • Calculation based methods 330 Histograms 30 A and B 30 20 20 10 10 0 0 331 • C and D 40 14 12 30 10 8 20 6 4 10 2 0 0 332 •E&F 20 10 0 333 Histograms can be tricky …. 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 7 7 6 6 6 5 5 6 5 4 3 2 1 5 4 4 4 3 3 2 2 1 1 0 0 3 2 1 0 334 Boxplots 335 P-P Plots •A&B 1.00 1.00 .75 .75 .50 .50 .25 .25 0.00 0.00 .25 .50 .75 1.00 0.00 0.00 .25 .50 .75 1.00 336 •C&D 1.00 1.00 .75 .75 .50 .50 .25 .25 0.00 0.00 .25 .50 .75 1.00 0.00 0.00 .25 .50 .75 1.00 337 •E&F 1.00 1.00 .75 .75 .50 .50 .25 .25 0.00 0.00 .25 .50 .75 1.00 0.00 0.00 .25 .50 .75 1.00 338 Calculation Based • Skew and Kurtosis statistics • Outlier detection statistics 339 Skew and Kurtosis Statistics • Normal distribution – skew = 0 – kurtosis = 0 • Two methods for calculation – Fisher’s and Pearson’s – Very similar answers • Associated standard error – can be used for significance (t-test) of departure from normality – not actually very useful • Never normal above N = 400 340 Skewness Kurtosis A B C D E F -0.12 0.271 0.454 0.117 2.106 0.171 -0.084 0.265 1.885 -1.081 5.75 -0.21 341 Outlier Detection • Calculate distance from mean – z-score (number of standard deviations) – deleted z-score • that case biased the mean, so remove it – Look up expected distance from mean • 1% 3+ SDs 342 Non-Normality in Regression 343 Effects on OLS Estimates • The mean is an OLS estimate • The regression line is an OLS estimate • Lack of normality – biases the position of the regression slope – makes the standard errors wrong • probability values attached to statistical significance wrong 344 Checks on Normality • Check residuals are normally distributed – Draw histogram residuals • Use regression diagnostics – Lots of them – Most aren’t very interesting 345 Regression Diagnostics • Residuals – Standardised, studentised-deleted – look for cases > |3| (?) • Influence statistics – Look for the effect a case has – If we remove that case, do we get a different answer? – DFBeta, Standardised DFBeta • changes in b 346 – DfFit, Standardised DfFit • change in predicted value • Distances – measures of ‘distance’ from the centroid – some include IV, some don’t 347 More on Residuals • Residuals are trickier than you might have imagined • Raw residuals – OK • Standardised residuals – Residuals divided by SD se  e n  k 1 2 348 Standardised / Studentised • Now we can calculate the standardised residuals – SPSS calls them studentised residuals – Also called internally studentised residuals ei ei  se 1  hi 349 Deleted Studentised Residuals • Studentised residuals do not have a known distribution – Cannot use them for inference • Deleted studentised residuals – Externally studentised residuals – Studentized (jackknifed) residuals • Distributed as t • With df = N – k – 1 350 Testing Significance • We can calculate the probability of a residual – Is it sampled from the same population • BUT – Massive type I error rate – Bonferroni correct it • Multiply p value by N 351 Bivariate Normality • We didn’t just say “residuals normally distributed” • We said “at every value of the outcomes” • Two variables can be normally distributed – univariate, – but not bivariate 352 • Couple’s IQs – male and female FEMALE MALE 8 6 5 6 4 4 3 2 Frequency 2 0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 1 0 140.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0 –Seem reasonably normal 353 • But wait!! 160 140 120 100 80 MALE 60 40 40 60 80 100 120 140 160 FEMALE 354 • When we look at bivariate normality – not normal – there is an outlier • So plot X against Y • OK for bivariate – but – may be a multivariate outlier – Need to draw graph in 3+ dimensions – can’t draw a graph in 3 dimensions • But we can look at the residuals instead … 355 • IQ histogram of residuals 12 10 8 6 4 2 0 356 Multivariate Outliers … • Will be explored later in the exercises • So we move on … 357 What to do about NonNormality • Skew and Kurtosis – Skew – much easier to deal with – Kurtosis – less serious anyway • Transform data – removes skew – positive skew – log transform – negative skew - square 358 Transformation • May need to transform IV and/or outcome – More often outcome • time, income, symptoms (e.g. depression) all positively skewed – can cause non-linear effects (more later) if only one is transformed – alters interpretation of unstandardised parameter – May alter meaning of variable • Some people say that this is such a big problem – Never transform – May add / remove non-linear and moderator effects 359 • Change measures – increase sensitivity at ranges • avoiding floor and ceiling effects • Outliers – Can be tricky – Why did the outlier occur? • Error? Delete them. • Weird person? Probably delete them • Normal person? Tricky. 360 – You are trying to model a process • is the data point ‘outside’ the process • e.g. lottery winners, when looking at salary • yawn, when looking at reaction time – Which is better? • A good model, which explains 99% of your data? (because we threw outliers out) • A poor model, which explains all of it (because we keep outliers in) • I prefer a good model 361 More on House Prices • Zillow.com tracks and predicts house prices – In the USA • Sometimes detects outliers – We don’t trust this selling price – We haven’t used it 362 Example in Stata • reg salary educ • predict res, res • hist res • gen logsalary= log(salary) • reg logsalary educ • predict logres, res • hist logres 363 4.0e-05 3.0e-05 0 Density 2.0e-05 1.0e-05 -20000 0 20000 40000 Residuals 60000 80000 2 1 0 Density 1.5 .5 -1 -.5 0 Residuals .5 1 But … • Parameter estimates change • Interpretation of parameter estimate is different • Exercise 7.0, 7.1 366 Bootstrapping • Bootstrapping is very, very cool • And very, very clever • But very, very simple 367 Bootstrapping • When we estimate a test statistic (F or r or t or c2) • We rely on knowing the sampling distribution • Which we know – If the distributional assumptions are satisfied 368 Estimate the Distribution • Bootstrapping lets you: – Skip the bit about distribution – Estimate the sampling distribution from the data • This shouldn’t be allowed – Hence bootstrapping – But it is 369 How to Bootstrap • We resample, with replacement • Take our sample – Sample 1 individual • Put that individual back, so that they can be sampled again – Sample another individual • Keep going until we’ve sampled as many people as were in the sample • Analyze the data • Repeat the process B times – Where B is a big number 370 Example Original 1 2 3 4 5 6 7 8 9 10 B1 1 1 3 3 3 3 7 7 9 9 B2 1 2 3 4 4 4 8 8 9 10 B3 2 2 3 2 4 4 6 7 9 9 371 • Analyze each dataset – Sampling distribution of statistic • Gives sampling distribution • 2 approaches to CI or P • Semi-parametric – Calculate standard error of statistic – Call that the standard deviation – Does not make assumption about distribution of data • Makes assumption about sampling distribution 372 • Non-parametric – Stata calls this percentile • Count. – If you have 1000 samples – 25th is lower CI – 975th is upper CI – P-value is proportion that cross zero • Non-parametric needs more samples 373 Bootstrapping in Stata • Very easy: – Use bootstrap: (or bs: or bstrap: ) prefix or – (Better) use vce(bootstrap) option • By default does 50 samples – Not enough – Use reps() – At least 1000 374 Example reg salary salbegin educ, vce(bootstrap, reps(50)) | Observed Bootstrap | Coef. Std. Err. z -----------+--------------------------------salbegin | 1.672631 .0863302 19.37 • Again salbegin | 1.672631 .0737315 22.69 375 More Reps • 1,000 reps – Z = 17.31 • Again – Z = 17.59 • 10,000 reps – 17.23 – 17.02 376 • Exercise 7.2, 7.3 377 Assumption 2: The variance of the residuals for every set of values for the predictor is equal. 378 Heteroscedasticity • This assumption is a about heteroscedasticity of the residuals – Hetero=different – Scedastic = scattered • We don’t want heteroscedasticity – we want our data to be homoscedastic • Draw a scatterplot to investigate 379 160 140 120 100 80 MALE 60 40 40 60 80 100 120 140 160 380 FEMALE • Only works with one IV – need every combination of IVs • Easy to get – use predicted values – use residuals there • Plot predicted values against residuals • A bit like turning the scatterplot to make the line of best fit flat 381 Good – no heteroscedasticity Predicted Value 382 Bad – heteroscedasticity Predicted Value 383 Testing Heteroscedasticity • White’s test 1. 2. 3. 4. Do regression, save residuals. Square residuals Square IVs Calculate interactions of IVs – e.g. x1•x2, x1•x3, x2 • x3 384 5. Run regression using – squared residuals as outcome – IVs, squared IVs, and interactions as IVs 6. Test statistic = N x R2 – Distributed as c2 – Df = k (for second regression) • Use education and salbegin to predict salary (employee data.sav) – R2 = 0.113, N=474, c2 = 53.5, df=5, p < 0.0001 • Automatic in Stata – estat imtest, white 385 60000 40000 0 20000 -20000 -40000 Residuals Plot of Predicted and Residual 0 50000 100000 150000 Linear prediction 386 White’s Test as Test of Interest • Possible to have a theory that predicts heteroscedasticity • Lupien, et al, 2006 – Heteroscedasticity in relationship of hippocampal volume and age 387 Magnitude of Heteroscedasticity • Chop data into 5 “slices” – Calculate variance of each slice – Check ratio of smallest to largest – Less than 5 • OK 388 gen slice = 1 replace slice replace slice replace slice replace slice = = = = 2 3 4 5 if if if if pred pred pred pred > > > > 30000 60000 90000 120000 bysort slice: su pred 1: 3954 5: 17116 (Doesn’t look too bad, thanks to skew in predictors) 389 Dealing with Heteroscedasticity • Use Huber-White (robust) estimates – – • Also called sandwich estimates Also called empirical estimates Use survey techniques – – Relatively straightforward in SAS and Stata, fiddly in SPSS Google: SPSS Huber-White 390 Why’s it a Sandwich? • SE can be calculated with: 1 n (X ' X ) 1 • Sandwich estimator: 1 1 1 1 ( n X ' X ) ( n X ' X )( n X ' X ) 1 391 Example • reg salary educ – Standard errors: – 204, 2821 • reg salary educ , robust – Standard errors: – 267 – 3347 • SEs usually go up, can go down 392 Heteroscedasticity – Implications and Meanings Implications • What happens as a result of heteroscedasticity? – Parameter estimates are correct • not biased – Standard errors (hence p-values) are incorrect 393 However … • If there is no skew in predicted scores – P-values a tiny bit wrong • If skewed, – P-values can be very wrong • Exercise 7.4 394 Robust SE Haiku T-stat looks too good. Use robust standard errors significance gone 395 Meaning • What is heteroscedasticity trying to tell us? – Our model is wrong – it is misspecified – Something important is happening that we have not accounted for • e.g. amount of money given to charity (given) – depends on: • earnings • degree of importance person assigns to the charity (import) 396 • Do the regression analysis – R2 = 0.60,, p < 0.001 • seems quite good – b0 = 0.24, p=0.97 – b1 = 0.71, p < 0.001 – b2 = 0.23, p = 0.031 • White’s test – c2 = 18.6, df=5, p=0.002 • The plot of predicted values against residuals … 397 20 10 0 -10 -20 30 40 50 Linear prediction 60 70 • Plot shows heteroscedastic relationship 398 • Which means … – the effects of the variables are not additive – If you think that what a charity does is important • you might give more money • how much more depends on how much money you have 399 70 60 50 40 30 20 5 10 import given Fitted values 15 given Fitted values 400 • One more thing about heteroscedasticity – it is the equivalent of homogeneity of variance in ANOVA/t-tests 401 • Exercise 7.4, 7.5, 7.6 402 Assumption 3: The Error Term is Additive 403 Additivity • What heteroscedasticity shows you – effects of variables need to be additive (assume no interaction between the variables) • Heteroscedasticity doesn’t always show it to you – can test for it, but hard work – (same as homogeneity of covariance assumption in ANCOVA) • Have to know it from your theory • A specification error 404 Additivity and Theory • Two IVs – Alcohol has sedative effect • A bit makes you a bit tired • A lot makes you very tired – Some painkillers have sedative effect • A bit makes you a bit tired • A lot makes you very tired – A bit of alcohol and a bit of painkiller doesn’t make you very tired – Effects multiply together, don’t add together 405 • If you don’t test for it – It’s very hard to know that it will happen • So many possible non-additive effects – Cannot test for all of them – Can test for obvious • In medicine – Choose to test for salient non-additive effects – e.g. sex, race • More on this, when we look at moderators 406 • Exercise 7.6 • Exercise 7.7 407 Assumption 4: At every value of the outcome the expected (mean) value of the residuals is zero 408 Linearity • Relationships between variables should be linear – best represented by a straight line • Not a very common problem in social sciences – measures are not sufficiently accurate (much measurement error) to make a difference • R2 too low • unlike, say, physics 409 Fuel • Relationship between speed of travel and fuel used Speed 410 • R2 = 0.938 – looks pretty good – know speed, make a good prediction of fuel • BUT – look at the chart – if we know speed we can make a perfect prediction of fuel used – R2 should be 1.00 411 Detecting Non-Linearity • Residual plot – just like heteroscedasticity • Using this example – very, very obvious – usually pretty obvious 412 Residual plot 413 Linearity: A Case of Additivity • Linearity = additivity along the range of the IV • Jeremy rides his bicycle harder – Increase in speed depends on current speed – Not additive, multiplicative – MacCallum and Mar (1995). Distinguishing between moderator and quadratic effects in multiple regression. Psychological Bulletin. 414 Assumption 5: The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation) 415 Independence Assumption • Also: lack of autocorrelation • Tricky one – often ignored – exists for almost all tests • All cases should be independent of one another – knowing the value of one case should not tell you anything about the value of other cases 416 How is it Detected? • Can be difficult – need some clever statistics (multilevel models) • Better off avoiding situations where it arises – Or handling it when it does arise • Residual Plots 417 Residual Plots • Were data collected in time order? – If so plot ID number against the residuals – Look for any pattern • Test for linear relationship • Non-linear relationship • Heteroscedasticity 418 2 Residual 1 0 -1 -2 0 10 20 30 40 Participant Number 419 How does it arise? Two main ways • time-series analyses – When cases are time periods • weather on Tuesday and weather on Wednesday correlated • inflation 1972, inflation 1973 are correlated • clusters of cases – patients treated by three doctors – children from different classes – people assessed in groups 420 Why does it matter? • Standard errors can be wrong – therefore significance tests can be wrong • Parameter estimates can be wrong – really, really wrong – from positive to negative • An example – students do an exam (on statistics) – choose one of three questions • IV: time • outcome: grade 421 •Result, with line of best fit 90 80 70 60 50 40 Grade 30 20 10 10 Time 20 30 40 50 60 70 422 • Result shows that – people who spent longer in the exam, achieve better grades • BUT … – we haven’t considered which question people answered – we might have violated the independence assumption • outcome will be autocorrelated • Look again – with questions marked 423 • Now somewhat different 90 80 70 60 50 40 Question 30 Grade 3 20 2 10 10 1 20 30 40 50 60 70 Time 424 • Now, people that spent longer got lower grades – questions differed in difficulty – do a hard one, get better grade – if you can do it, you can do it quickly 425 Dealing with NonIndependence • For time series data – Time series analysis (another course) – Multilevel models (hard, some another course) • For clustered data – Robust standard errors – Generalized estimating equations – Multilevel models 426 Cluster Robust Standard Errors • Predictor: School size • Outcome: Grades • Sample: – 20 schools – 20 children per school • What is the N? 427 Robust Standard Errors • Sample is: – 400 children – is it 400? – Not really • Each child adds information • First child in a school adds lots of information about that school – 100th child in a school adds less information’ – How much less depends on how similar the children in the school are – 20 schools • It’s more than 20 428 Robust SE in Stata • Very easy • reg predictor outcome , robust cluster(clusterid) • BUT – Only to be used where clustering is a nuisance only • Only adjusts standard errors, not parameter estimates • Only to be used where parameter estimates shouldn’t be affected by clustering 429 Example of Robust SE • Effects of incentives for attendance at adult literacy class – Some students rewarded for attendance – Others not rewarded • 152 classes randomly assigned to each condition – Scores measured at mid term and final 430 Example of Robust SE • Naïve – reg postscore tx midscore – Est: -.6798066 SE: .7218797 • Clustered – reg postscore tx midscore, robust cluster(classid) – Est: -.6798066 SE .9329929 431 Problem with Robust Estimates • Only corrects standard error – Does not correct estimate • Other predictors must be uncorrelated with predictors of group membership – Or estimates wrong • Two alternatives: – Generalized estimating equations (gee) – Multilevel models 432 Independence + Heteroscedasticity • Assumption is that residuals are: – Independently and identically distributed • i.i.d. • Same procedure used for both problems – Really, same problem 433 • Exercise 7.9, exercise 7.10 434 Assumption 6: All predictor variables are uncorrelated with the error term. 435 Uncorrelated with the Error Term • A curious assumption – by definition, the residuals are uncorrelated with the predictors (try it and see, if you like) • There are no other predictors that are important – That correlate with the error – i.e. Have an effect 436 • Problem in economics – Demand increases supply – Supply increases wages – Higher wages increase demand • OLS estimates will be (badly) biased in this case – need a different estimation procedure – two-stage least squares • simultaneous equation modelling – Instrumental variables 437 Another Haiku Supply and demand: without a good instrument, not identified. 438 Assumption 7: No predictors are a perfect linear function of other predictors no perfect multicollinearity 439 No Perfect Multicollinearity • IVs must not be linear functions of one another – matrix of correlations of IVs is not positive definite – cannot be inverted – analysis cannot proceed • Have seen this with – age, age start, time working (can’t have all three in the model) – also occurs with subscale and total in model at the same time 440 • Large amounts of collinearity – a problem (as we shall see) sometimes – not an assumption • Exercise 7.11 441 Assumption 8: The mean of the error term is zero. You will like this one. 442 Mean of the Error Term = 0 • Mean of the residuals = 0 • That is what the constant is for – if the mean of the error term deviates from zero, the constant soaks it up Y   0  1 x1   Y  (  0  3)  1 x1  (  3) - note, Greek letters because we are talking about population values 443 • Can do regression without the constant – Usually a bad idea – E.g R2 = 0.995, p < 0.001 • Looks good 444 13 12 y 11 10 9 8 7 6 7 8 9 10 11 12 13 x1 445 Lesson 8: Issues in Regression Analysis Things that alter the interpretation of the regression equation 446 The Four Issues • • • • Causality Sample sizes Collinearity Measurement error 447 Causality 448 What is a Cause? • Debate about definition of cause – some statistics (and philosophy) books try to avoid it completely – We are not going into depth • just going to show why it is hard • Two dimensions of cause – Ultimate versus proximal cause – Determinate versus probabilistic 449 Proximal versus Ultimate • Why am I here? – I walked here because – This is the location of the class because – Eric Tanenbaum asked me because – (I don’t know) – because I was in my office when he rang because – I was a lecturer at Derby University because – I saw an advert in the paper because 450 – I exist because – My parents met because – My father had a job … • Proximal cause – the direct and immediate cause of something • Ultimate cause – the thing that started the process off – I fell off my bicycle because of the bump – I fell off because I was going too fast 451 Determinate versus Probabilistic Cause • Why did I fall off my bicycle? – I was going too fast – But every time I ride too fast, I don’t fall off – Probabilistic cause • Why did my tyre go flat? – A nail was stuck in my tyre – Every time a nail sticks in my tyre, the tyre goes flat – Deterministic cause 452 • Can get into trouble by mixing them together – Eating deep fried Mars Bars and doing no exercise are causes of heart disease – “My Grandad ate three deep fried Mars Bars every day, and the most exercise he ever got was when he walked to the shop next door to buy one” – (Deliberately?) confusing deterministic and probabilistic causes 453 Criteria for Causation • Association (correlation) • Direction of Influence (a  b) • Isolation (not c  a and c  b) 454 Association • Correlation does not mean causation – we all know • But – Causation does mean correlation • Need to show that two things are related – may be correlation – may be regression when controlling for third (or more) factor 455 • Relationship between price and sales – suppliers may be cunning – when people want it more • stick the price up Price Price Demand Sales 1 0.6 0 Demand 0.6 1 0.6 Sales 0 0.6 1 – So – no relationship between price and sales 456 – Until (or course) we control for demand – b1 (Price) = -0.56 – b2 (Demand) = 0.94 • But which variables do we enter? 457 Direction of Influence • Relationship between A and B – three possible processes A B A B B causes A A B C causes A & B C A causes B 458 • How do we establish the direction of influence? – Longitudinally? Barometer Drops Storm – Now if we could just get that barometer needle to stay where it is … • Where the role of theory comes in (more on this later) 459 Isolation • Isolate the outcome from all other influences – as experimenters try to do • Cannot do this – can statistically isolate the effect – using multiple regression 460 Role of Theory • Strong theory is crucial to making causal statements • Fisher said: to make causal statements “make your theories elaborate.” – don’t rely purely on statistical analysis • Need strong theory to guide analyses – what critics of non-experimental research don’t understand 461 • S.J. Gould – a critic – says correlate price of petrol and his age, for the last 10 years – find a correlation – Ha! (He says) that doesn’t mean there is a causal link – Of course not! (We say). • No social scientist would do that analysis without first thinking (very hard) about the possible causal relations between the variables of interest • Would control for time, prices, etc … 462 • Atkinson, et al. (1996) – relationship between college grades and number of hours worked – negative correlation – Need to control for other variables – ability, intelligence • Gould says “Most correlations are noncausal” (1982, p243) – Of course!!!! 463 I drink a lot of beer 16 causal relations 120 non-causal correlations laugh bathroom jokes (about statistics) children wake early karaoke curtains closed sleeping headache equations (beermat) thirsty fried breakfast no beer curry chips falling over lose keys 464 • Abelson (1995) elaborates on this – ‘method of signatures’ • A collection of correlations relating to the process – the ‘signature’ of the process • e.g. tobacco smoking and lung cancer – can we account for all of these findings with any other theory? 465 1. 2. 3. 4. 5. 6. 7. 8. The longer a person has smoked cigarettes, the greater the risk of cancer. The more cigarettes a person smokes over a given time period, the greater the risk of cancer. People who stop smoking have lower cancer rates than do those who keep smoking. Smoker’s cancers tend to occur in the lungs, and be of a particular type. Smokers have elevated rates of other diseases. People who smoke cigars or pipes, and do not usually inhale, have abnormally high rates of lip cancer. Smokers of filter-tipped cigarettes have lower cancer rates than other cigarette smokers. Non-smokers who live with smokers have elevated cancer rates. (Abelson, 1995: 183-184) 466 – In addition, should be no anomalous correlations • If smokers had more fallen arches than nonsmokers, not consistent with theory • Failure to use theory to select appropriate variables – specification error – e.g. in previous example – Predict wealth from price and sales • increase price, price increases • Increase sales, price increases 467 • Sometimes these are indicators of the process, not the process itself – e.g. barometer – stopping the needle won’t help – e.g. inflation? Indicator or cause of economic health? 468 No Causation without Experimentation • Blatantly untrue – I don’t doubt that the sun shining makes us warm • Why the aversion? – Pearl (2000) says problem is that there is no mathematical operator (e.g. “=“) – No one realised that you needed one – Until you build a robot 469 AI and Causality • A robot needs to make judgements about causality • Needs to have a mathematical representation of causality – Suddenly, a problem! – Doesn’t exist • Most operators are non-directional • Causality is directional 470 Sample Sizes “How many subjects does it take to run a regression analysis?” 471 Introduction • Social scientists don’t worry enough about the sample size required – “Why didn’t you get a significant result?” – “I didn’t have a large enough sample” • Not a common answer, but very common reason • More recently awareness of sample size is increasing – use too few – no point doing the research – use too many – waste their time 472 • Research funding bodies • Ethical review panels – both become more interested in sample size calculations • We will look at two approaches – Rules of thumb (quite quickly) – Power Analysis (more slowly) 473 Rules of Thumb • Lots of simple rules of thumb exist – 10 cases per IV – and at least 100 cases – Green (1991) more sophisticated • To test significance of R2 – N = 50 + 8k • To test significance of slopes, N = 104 + k • Rules of thumb don’t take into account all the information that we have – Power analysis does 474 Power Analysis Introducing Power Analysis • Hypothesis test – tells us the probability of a result of that magnitude occurring, if the null hypothesis is correct (i.e. there is no effect in the population) • Doesn’t tell us – the probability of that result, if the null hypothesis is false (i.e., there actually is an effect in the population) 475 • According to Cohen (1982) all null hypotheses are false – everything that might have an effect, does have an effect • it is just that the effect is often very tiny 476 Type I Errors • Type I error is false rejection of H0 • Probability of making a type I error – a – the significance value cut-off • usually 0.05 (by convention) • Always this value • Not affected by – sample size – type of test 477 Type II errors • Type II error is false acceptance of the null hypothesis – Much, much trickier • We think we have some idea – we almost certainly don’t • Example – I do an experiment (random sampling, all assumptions perfectly satisfied) – I find p = 0.05 478 – You repeat the experiment exactly • different random sample from same population – What is probability you will find p < 0.05? – Answer: 0.5 – Another experiment, I find p = 0.01 – Probability you find p < 0.05? – Answer: 0.79 • Very hard to work out – not intuitive – need to understand non-central sampling distributions (more in a minute) 479 • Probability of type II error = beta () – same as population regression parameter (to be confusing) • Power = 1 – Beta – Probability of getting a significant result (given that there is a significant result to be found) 480 State of the World Research Findings H0 True (no effect to be found) H0 false (effect to be found) H0 true (we find no effect – p > 0.05)  Type II error p= power = 1 -  H0 false (we find an effect – p < 0.05) Type I error p=a  481 • Four parameters in power analysis – a – prob. of Type I error –  – prob. of Type II error (power = 1 – ) – Effect size – size of effect in population –N • Know any three, can calculate the fourth – Look at them one at a time 482 • a Probability of Type I error – Usually set to 0.05 – Somewhat arbitrary • sometimes adjusted because of circumstances – rarely because of power analysis – May want to adjust it, based on power analysis 483 •  – Probability of type II error – Power (probability of finding a result) =1– – Standard is 80% • Some argue for 90% – Implication that Type I error is 4 times more serious than type II error • adjust ratio with compromise power analysis 484 • Effect size in the population – Most problematic to determine – Three ways 1. What effect size would be useful to find? • R2 = 0.01 - no use (probably) 2. Base it on previous research – what have other people found? 3. Use Cohen’s conventions – small R2 = 0.02 – medium R2 = 0.13 – large R2 = 0.26 485 – Effect size usually measured as f2 – For R2 2 R f  2 1 R 2 486 – For (standardised) slopes 2 sri f  2 1 R 2 – Where sr2 is the contribution to the variance accounted for by the variable of interest – i.e. sr2 = R2 (with variable) – R2 (without) • change in R2 in hierarchical regression 487 • N – the sample size – usually use other three parameters to determine this – sometimes adjust other parameters (a) based on this – e.g. You can have 50 participants. No more. 488 Doing power analysis • With power analysis program – SamplePower, Gpower (free), Nquery – With Stata command sampsi • Which I find very confusing • But we’ll use it anyway 489 sampsi • Limited in usefulness – A categorical, two group predictor • sampsi 0 0.5, pre(1) r01(0.5) n1(50) sd(1) – Find power for detecting an effect of 0.5 • When there’s one other variable at baseline • Which correlates 0.5 • 50 people in each group • When sd is 1.0 490 sampsi … Method: ANCOVA relative efficiency = adjustment to sd = adjusted sd1 = 1.143 0.935 0.935 Estimated power: power = 0.762 491 GPower • Better for regression designs 492 Underpowered Studies • Research in the social sciences is often underpowered – Why? – See Paper B11 – “the persistence of underpowered studies” 495 Extra Reading • Power traditionally focuses on p values – What about CIs? – Paper B8 – “Obtaining regression coefficients that are accurate, not simply significant” 496 • Exercise 8.1 497 Collinearity 498 Collinearity as Issue and Assumption • Collinearity (multicollinearity) – the extent to which the predictors are (multiply) correlated • If R2 for any IV, using other IVs = 1.00 – perfect collinearity – variable is linear sum of other variables – regression will not proceed – (SPSS will arbitrarily throw out a variable) 499 • R2 < 1.00, but high – other problems may arise • Four things to look at in collinearity – meaning – implications – detection – actions 500 Meaning of Collinearity • Literally ‘co-linearity’ – lying along the same line • Perfect collinearity – when some IVs predict another – Total = S1 + S2 + S3 + S4 – S1 = Total – (S2 + S3 + S4) – rare 501 • Less than perfect – when some IVs are close to predicting other IVs – correlations between IVs are high (usually, but not always)  high multiple correlations 502 Implications • Effects the stability of the parameter estimates – and so the standard errors of the parameter estimates – and so the significance and CIs • Because – shared variance, which the regression procedure doesn’t know where to put 503 • Sex differences – due to genetics? – due to upbringing? – (almost) perfect collinearity • statistically impossible to tell 504 • When collinearity is less than perfect – increases variability of estimates between samples – estimates are unstable – reflected in the variances, and hence standard errors 505 Detecting Collinearity • Look at the parameter estimates – large standardised parameter estimates (>0.3?), which are not significant • be suspicious • Run a series of regressions – each IV as outcome – all other IVs as IVs • for each IV 506 • Sounds like hard work? – SPSS does it for us! • Ask for collinearity diagnostics – Tolerance – calculated for every IV Tolerance  1-R 2 – Variance Inflation Factor • sq. root of amount s.e. has been increased 1 VIF  Tolerance 507 Actions What you can do about collinearity “no quick fix” (Fox, 1991) 1. Get new data • • • avoids the problem address the question in a different way e.g. find people who have been raised as the ‘wrong’ gender • • exist, but rare Not a very useful suggestion 508 2. Collect more data • • • not different data, more data collinearity increases standard error (se) se decreases as N increases • get a bigger N 3. Remove / Combine variables • • • If an IV correlates highly with other IVs Not telling us much new If you have two (or more) IVs which are very similar • e.g. 2 measures of depression, socioeconomic status, achievement, etc 509 • • sum them, average them, remove one Many measures • use principal components analysis to reduce them 3. Use stepwise regression (or some flavour of) • • See previous comments Can be useful in theoretical vacuum 4. Ridge regression • • not very useful behaves weirdly 510 • Exercise 8.2, 8.3, 8.4 511 Measurement Error 512 What is Measurement Error • In social science, it is unlikely that we measure any variable perfectly – measurement error represents this imperfection • We assume that we have a true score – T • A measure of that score –x 513 xT e • just like a regression equation – standardise the parameters – T is the reliability • the amount of variance in x which comes from T • but, like a regression equation – assume that e is random and has mean of zero – more on that later 514 Simple Effects of Measurement Error • Lowers the measured correlation – between two variables • Real correlation – true scores (x* and y*) • Measured correlation – measured scores (x and y) 515 True correlation of x and y rx*y* x* e y* Reliability of x rxx Reliability of y ryy x y Measured correlation of x and y rxy e 516 • Attenuation of correlation rxy  rx * y *  rxx ryy • Attenuation corrected correlation rx * y *  rxy rxx ryy 517 • Example rxx  0.7 ryy  0.8 rxy  0.3 rx* y*  rx* y* rxy rxx ryy 0.3   0.40 0.7  0.8 518 Complex Effects of Measurement Error • Really horribly complex • Measurement error reduces correlations – reduces estimate of  – reducing one estimate • increases others – because of effects of control – combined with effects of suppressor variables – exercise to examine this 519 Dealing with Measurement Error • Attenuation correction – very dangerous – not recommended • Avoid in the first place – use reliable measures – don’t discard information • don’t categorise • Age: 10-20, 21-30, 31-40 … 520 Complications • Assume measurement error is – additive – linear • Additive – e.g. weight – people may under-report / overreport at the extremes • Linear – particularly the case when using proxy variables 521 • e.g. proxy measures – Want to know effort on childcare, count number of children • 1st child is more effort than 19th child – Want to know financial status, count income • 1st £1 much greater effect on financial status than the 1,000,000th. 522 • Exercise 8.5 523 Lesson 9: Non-Linear Analysis in Regression 524 Introduction • Non-linear effect occurs – when the effect of one predictor – is not consistent across the range of the IV • Assumption is violated – expected value of residuals = 0 – no longer the case 525 Some Examples 526 Skill A Learning Curve Experience 527 Performance Yerkes-Dodson Law of Arousal Arousal 528 Suicidal Enthusiastic Enthusiasm Levels over a Lesson on Regression 0 Time 3.5 529 • Learning – line changed direction once • Yerkes-Dodson – line changed direction once • Enthusiasm – line changed direction twice 530 Everything is Non-Linear • Every relationship we look at is nonlinear, for two reasons – Exam results cannot keep increasing with reading more books • Linear in the range we examine – For small departures from linearity • Cannot detect the difference • Non-parsimonious solution 531 Non-Linear Transformations 532 Bending the Line • Non-linear regression is hard – We cheat, and linearise the data • Do linear regression Transformations • We need to transform the data – rather than estimating a curved line • which would be very difficult • may not work with OLS – we can take a straight line, and bend it – or take a curved line, and straighten it • back to linear (OLS) regression 533 • We still do linear regression – Linear in the parameters – Y = b1x + b2x2 + … • Can do non-linear regression – Non-linear in the parameters – Y = b1x + b2x2 + … • Much trickier – Statistical theory either breaks down OR becomes harder 534 • Linear transformations – multiply by a constant – add a constant – change the slope and the intercept 535 y=2x y y=x + 3 y=x x 536 • Linear transformations are no use – alter the slope and intercept – don’t alter the standardised parameter estimate • Non-linear transformation – will bend the slope – quadratic transformation y = x2 – one change of direction 537 – Cubic transformation y = x2 + x3 – two changes of direction 538 • To estimate a non-linear regression – we don’t actually estimate anything nonlinear – we transform the x-variable to a non-linear version – can estimate that straight line – represents the curve – we don’t bend the line, we stretch the space around the line, and make it flat 539 Detecting Non-linearity 540 Draw a Scatterplot • Draw a scatterplot of y plotted against x – see if it looks a bit non-linear – e.g. Education and beginning salary • from bank data • with line of best fit 541 A Real Example • Starting salary and years of education – From employee data.sav 542 80000 0 20000 40000 60000 Expected value of error (residual) is > 0 5 10 Expected 20 value of error beginning salary (residual) is < 0 15 educational level (years) Fitted values 543 Use Residual Plot • Scatterplot is only good for one variable – use the residual plot (that we used for heteroscedasticity) • Good for many variables 544 • We want – points to lie in a nice straight sausage 545 • We don’t want – a nasty bent sausage 546 -20000 0 20000 40000 60000 • Educational level and starting salary 5000 10000 15000 20000 Linear prediction Fitted values 25000 30000 Residuals 547 Carrying Out Non-Linear Regression 548 Linear Transformation • Linear transformation doesn’t change – interpretation of slope – standardised slope – se, t, or p of slope – R2 • Can change – effect of a transformation 549 • Actually more complex – with some transformations can add a constant with no effect (e.g. quadratic) • With others does have an effect – inverse, log • Sometimes it is necessary to add a constant – negative numbers have no square root – 0 has no log 550 Education and Salary Linear Regression • Saw previously that the assumption of expected errors = 0 was violated • Anyway … – R2 = 0.401, p < 0.001 – salbegin = -6290 + 1727  educ – Standardised • b1 (educ) = 0.633 – Both parameters make sense 551 Non-linear Effect • Compute new variable – quadratic – educ2 = educ2 • Add this variable to the equation – R2 = 0.585, p < 0.001 – salbegin = 46263 + -6542  educ + 310  educ2 • slightly curious – Standardised • b1 (educ) = -2.4 • b2 (educ2) = 3.1 – What is going on? 552 • Collinearity – is what is going on – Correlation of educ and educ2 • r = 0.990 – Regression equation becomes difficult (impossible?) to interpret • Need hierarchical regression – what is the change in R2 – is that change significant? – R2 (change) = 0.184, p < 0.001 553 Cubic Effect • While we are at it, let’s look at the cubic effect – R2 (change) = 0.004, p = 0.045 – 19138 + 103  e + -206  e2 + 12  e3 – Standardised: b1(e) = 0.04 b2(e2) = -2.04 b3(e3) = 2.71 554 Fourth Power • Keep going while we are ahead? – When do we stop? 555 Interpretation • Tricky, given that parameter estimates are a bit nonsensical • Two methods • 1: Use R2 change – Save predicted values • or calculate predicted values to plot line of best fit – Save them from equation – Plot against IV 556 80000 60000 40000 0 20000 5 10 15 educational level (years) Linear prediction Linear prediction Linear prediction 20 Linear prediction Linear prediction beginning salary 557 • Differentiate with respect to e • We said: s = 19138 + 103  e + -206  e2 + 12  e3 – but first we will simplify it to quadratic s = 46263 + -6542  e + 310  e2 • dy/dx = -6542 + 310 x 2 x e 558 Education Slope 9 -962 10 -342 11 278 12 898 13 1518 14 2138 15 2758 16 3378 17 3998 18 4618 19 5238 20 5858 1 year of education at the higher end of the scale, better than 1 year at the lower end of the scale. MBA versus GCSE 559 • Differentiate Cubic 19138 + 103  e + -206  e2 + 12  e3 dy/dx = 103 – 206  2  e + 12  3  e2 • Can calculate slopes for quadratic and cubic at different values 560 Education Slope (Quad) Slope (Cub) 9 -962 -689 10 -342 -417 11 278 -73 12 898 343 13 1518 831 14 2138 1391 15 2758 2023 16 3378 2727 17 3998 3503 18 4618 4351 19 5238 5271 20 5858 6263 561 A Quick Note on Differentiation • For y = xp – dx/dy = pxp-1 • For equations such as y =b1x + b2xP dy/dx = b1 + b2pxp-1 • y = 3x + 4x2 – dy/dx = 3 + 4 • 2x 562 • y = b1x + b2x2 + b3x3 – dy/dx = b1 + b2 • 2x + b3 • 3 • x2 • y = 4x + 5x2 + 6x3 • dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2 • Many functions are simple to differentiate – Not all though 563 Splines and Knots • Estimate a different slope following an event – Lines are splines – Events are knots • Event might be known – Marriage • Might be unknown – How many years after brain injury does recovery start 564 Lesson 10: Regression for Counts and Categories Dichotomous/Nominal outcomes 565 Contents • General and Generalized Linear Models • Dichotomous – logistic / probit • Counts – Poisson and negative binomial 566 GLMs and GLMs • General linear models – Ordinary least squares regression based models – Identity link function – Regression, ANOVA, correlation, etc • Generalized linear models – More links – More error structures – General linear models are a subset of generalized linear models 567 Dichotomous • Often in social sciences, we have a dichotomous/nominal outcome – we will look at dichotomous first, then a quick look at multinomial • Dichotomous outcome • e.g. – – – – guilty/not guilty pass/fail won/lost Alive/dead (used in medicine) 568 Why Won’t OLS Do? 569 Example: PTSD in Veterans • How does length of deployment affect probability of PTSD? – Have PTSD, or don’t. – We might be interested in severity • Army are not • If you have PTSD, you need help – Not going back • Develop a selection procedure – Two predictor variables – Rank – 1 =Staff Sgt, 5 = Private, – Deployment length (months) 570 • 1st ten cases Rank 5 1 1 4 1 1 4 1 3 4 Months 6 15 12 6 15 6 16 10 12 26 PTSD 0 0 0 0 1 0 1 1 0 1 571 • outcome – PTSD (1 = Yes, 0 = No) • Just consider score first – Carry out regression – Rank as predictor, PTSD as outcome – R2 = 0.097, F = 4.1, df = 1, 48, p = 0.028. – b0 = 0.190 – b1 = 0.110, p=0.028 • Seems OK 572 • Residual plot 573 • Problems 1 and 2 – strange distributions of residuals – parameter estimates may be wrong – standard errors will certainly be wrong 574 • 2nd problem – interpretation – I have rank 2 – Pass = 0.190 + 0.110  2 = 0.41 – I have rank 8 – Pass = 0.190 + 0.110  8 = 1.07 • Seems OK, but – What does it mean? – Cannot score 0.41 or 1.07 • can only score 0 or 1 • Cannot be interpreted – need a different approach 575 A Different Approach Logistic Regression 576 Logit Transformation • In lesson 9, transformed IVs – now transform the outcome • Need a transformation which gives us – graduated scores (between 0 and 1) – No upper limit • we can’t predict someone will pass twice – No lower limit • you can’t do worse than fail 577 Step 1: Convert to Probability • First, stop talking about values – talk about probability – for each value of score, calculate probability of pass • Solves the problem of graduated scales 578 probability of PTSD given a rank of 1 is 0.7 Score 1 2 3 4 5 No N 7 5 6 4 2 PTSD P 0.7 0.5 0.6 0.4 0.2 N 3 5 4 6 8 PTSD P 0.3 0.5 0.4 0.6 0.8 probability of PTSD given a rank of 5 is 0.2 579 This is better • Now a score of 0.41 has a meaning – a 0.41 probability of pass • But a score of 1.07 has no meaning – cannot have a probability > 1 (or < 0) – Need another transformation 580 Step 2: Convert to Odds-Ratio Need to remove upper limit • Convert to odds • Odds, as used by betting shops – 5:1, 1:2 • Slightly different from odds in speech – a 1 in 2 chance – odds are 1:1 (evens) – 50% 581 • Odds ratio = (number of times it happened) / (number of times it didn’t happen) p(event) p(event ) odds ratio   p(not event ) 1  p(event ) 582 • 0.8 = 0.8/0.2 = 4 – equivalent to 4:1 (odds on) – 4 times out of five • 0.2 = 0.2/0.8 = 0.25 – equivalent to 1:4 (4:1 against) – 1 time out of five 583 • Now we have solved the upper bound problem – we can interpret 1.07, 2.07, 1000000.07 • But we still have the zero problem – we cannot interpret predicted scores less than zero 584 Step 3: The Log • Log10 of a number(x) log( x ) 10 x • log(10) = 1 • log(100) = 2 • log(1000) = 3 585 • log(1) = 0 • log(0.1) = -1 • log(0.00001) = -5 586 Natural Logs and e • Don’t use log10 – Use loge • Natural log, ln • Has some desirable properties, that log10 doesn’t – – – – For us If y = ln(x) + c dy/dx = 1/x Not true for any other logarithm 587 • Be careful – calculators and stats packages are not consistent when they use log – Sometimes log10, sometimes loge 588 Take the natural log of the odds ratio • Goes from -  + – can interpret any predicted value 589 Putting them all together • Logit transformation – log-odds ratio – not bounded at zero or one 590 Score 1 No PTSD PTSD N P N P Odds (No PTSD) log(odds)No PTSD 2 3 4 7 5 6 4 0.7 0.5 0.6 0.4 3 5 4 6 0.3 0.5 0.4 0.6 5 2 0.2 8 0.8 2.33 1.00 1.50 0.67 0.25 0.85 0.00 0.41 -0.41 -1.39 591 probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Probability gets closer to zero, but never reaches it as logit goes down. -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Logit 592 3.5 • Hooray! Problem solved, lesson over – errrmmm… almost • Because we are now using log-odds ratio, we can’t use OLS – we need a new technique, called Maximum Likelihood (ML) to estimate the parameters 593 Parameter Estimation using ML ML tries to find estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data • All gets a bit complicated – OLS is a special case of ML – the mean is an ML estimator 594 • Don’t have closed form equations – must be solved iteratively – estimates parameters that are most likely to give rise to the patterns observed in the data – by maximising the likelihood function (LF) • We aren’t going to worry about this – except to note that sometimes, the estimates do not converge • ML cannot find a solution 595 R2 in Logistic Regression • A dichotomous variable doesn’t have variance – If you know the mean (proportion) you know the variance – You can’t have R2. • There are several pseudo-R2 • None are perfect – There’s something better 596 Logistic Regression in Stata • Exercise 10.1 • Two (almost) equivalent commands – logistic ptsd rank deployment – logit ptsd rank deployment 597 Logit • Gives output in log-odds • logit ptsd rank deployment -----------------------------------------------------------------------------pass | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------deployment | 1.158213 .0841987 2.02 0.043 1.004404 1.335575 rank | 1.333192 .4011279 0.96 0.339 .7392395 2.404365 ------------------------------------------------------------------------------ 598 Logistic • Gives output in odds ratios • – No intercept logit ptsd rank deployment -----------------------------------------------------------------------------pass | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------deployment | 1.158213 .0841987 2.02 0.043 1.004404 1.335575 rank | 1.333192 .4011279 0.96 0.339 .7392395 2.404365 ------------------------------------------------------------------------------ 599 • SPSS produces a classification table – And Stata produces it if you ask – predictions of model – based on cut-off of 0.5 (by default) – predicted values x actual values • DO NOT USE IT! • Will this person go to prison? – No. – You will be right 99.9% of the time – Doesn’t mean you have a good model – (Gottman and Murray – Blink) 600 Classification Tablea Predicted PASS Observed Step 1 PASS 0 Percentage Correct 1 0 18 8 69.2 1 12 12 50.0 Overall Percentage 60.0 a. The cut value is .500 601 Model parameters •B – Change in the logged odds associated with a change of 1 unit in IV – just like OLS regression – difficult to interpret • SE (B) – Standard error – Multiply by 1.96 to get 95% CIs 602 • Constant – i.e. score = 0 – B = 1.314 – Exp(B) = eB = e1.314 = 3.720 – OR = 3.720, p = 1 – (1 / (OR + 1)) = 1 – (1 / (3.720 + 1)) – p = 0.788 603 • Score 1 – Constant b = 1.314 – Score B = -0.467 – Exp(1.314 – 0.467) = Exp(0.847) = 2.332 – OR = 2.332 – p = 1 – (1 / (2.332 + 1)) = 0.699 604 Standard Errors and CIs • Symmetrical in B – Non-symmetrical (sometimes very) in exp(B) 605 • The odds of failing the test are multiplied by 0.63 (CIs = 0.408, 0.962 p = 0.033), for every additional point on the aptitude test. 606 Hierarchical Logistic Regression • In OLS regression – Use R2 change • In logistic regression – Use chi-square change • Difference in chi-square = chi-square • Difference in df = df 607 Hierarchical Logistic Regression • Model 1: Experience • Model 2: Experience + Score • Model 1: – Chi-square =4.83, df = 1 • Model 2: – Chi-square =5.77, df = 2 608 • Difference: – Chi-square = 5.77-4.83= 1.94, – Df = 2 – 1 = 1 • gen p = 1 - chi2(1, 1.94) • tab p • p = 0.332 • P-value from SE = 0.339 • Why? 609 More on Standard Errors • Because of Wald standard errors – Wald SEs are overestimated – Make p-value in estimates is wrong – too high – (CIs still correct) 610 • Two estimates use slightly different information – P-value says “what if no effect” – CI says “what if there is this effect” • Variance depends on the hypothesised ratio of the number of people in the two groups • Can calculate likelihood ratio based pvalues – If you can be bothered – Some packages provide them automatically 611 Probit Regression • Very similar to logistic – much more complex initial transformation (to normal distribution) – Very similar results to logistic (multiplied by 1.7) • Swap logistic for probit in Stata command – Harder to interpret • Parameter doesn’t mean something – like log odds 612 Differentiating Between Probit and Logistic • Depends on shape of the error term – Normal or logistic – Graphs are very similar to each other • Could distinguish quality of fit – Given enormous sample size • Logistic = probit x 1.7 – Actually 1.6998 • Probit advantage – Understand the distribution • Logistic advantage – Much simpler to get back to the probability 613 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 1 -1 -1.2 -1.4 -1.6 -1.8 -2 -2.2 -2.4 -2.6 -2.8 -3 1.2 Normal (Probit) Logistic 0.8 0.6 0.4 0.2 0 614 Infinite Parameters • Non-convergence can happen because of infinite parameters – Insoluble model • Three kinds: • Complete separation – The groups are completely distinct • Pass group all score more than 10 • Fail group all score less than 10 615 • Quasi-complete separation – Separation with some overlap • Pass group all score 10 or more • Fail group all score 10 or less • Both cases: – No convergence • Close to this – Curious estimates – Curious standard errors 616 • Categorical Predictors – Can cause separation – Especially if correlated • Need people in every cell Male White Non-White Female White Non-White Below Poverty Line Above Poverty Line 617 Logistic Regression and Diagnosis • Logistic regression can be used for diagnostic tests – For every score • Calculate probability that result is positive • Calculate proportion of people with that score (or lower) who have a positive result • Calculate c statistic – Measure of discriminative power – % of all possible cases, where the model gives a higher probability to a correct case than to an incorrect case 618 – Perfect c-statistic = 1.0 – Random c-statistic = 0.5 619 Sensitivity and Specificity • Sensitivity: – Probability of saying someone has a positive result – • If they do: p(pos)|pos • Specificity – Probability of saying someone has a negative result • If they do: p(neg)|neg 620 C-Statistic, Sensitivity and Specificity • After logistic – lroc • Gives c-statistic – Better than R-squared 621 1.00 0.75 0.50 0.25 0.00 0.00 0.25 Area under ROC curve = 0.7469 0.50 1 - Specificity 0.75 1.00 More Advanced Techniques • Multinomial Logistic Regression more than two categories in outcome – same procedure – one category chosen as reference group • odds of being in category other than reference • Ordinal multinomial logistic regression – For ordinal outcome variables 623 More on Odds Ratios • Odds ratios are horrid • We use them because they have nice distributional properties • Example: – 40% in group 1 get PTSD – 60% in group 2 get PTSD – What’s the odds ratio? – How is this confusing? 624 Alternatives to Odds Ratios • Risk difference – 20 percentage points higher • Relative risk – Probability is 1.5 times higher – This is what you would think an odds ratio meant • Can we use these in regression? – RD – maybe. Sometimes. – RR – yes. But we need to do something else first 625 Final Thoughts • Logistic Regression can be extended – dummy variables – non-linear effects – interactions • Same issues as OLS – collinearity – outliers 626 • Same additional options as regress – xi: – cluster – robust 627 Poisson Regression 628 Counts and the Poisson Distribution • Von Bortkiewicz (1898) – Numbers of Prussian soldiers kicked to death by horses 120 100 80 60 0 1 2 3 4 5 109 65 22 3 1 0 40 20 0 0 1 2 3 4 5 629 • The data fitted a Poisson probability distribution – When counts of events occur, poisson distribution is common – E.g. papers published by researchers, police arrests, number of murders, ship accidents • Common approach – Log transform and treat as normal • Problems – Censored at 0 – Integers only allowed – Heteroscedasticity 630 The Poisson Distribution 0.7 0.6 Probability 0.5 0.5 1 4 8 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 Count 10 11 12 13 14 15 16 17 631 exp(   )  p ( y | x)  y! y 632 exp(  )  p ( y | x)  y! y Excel has a Poisson function you can use. • Where: – y is the count –  is the mean of the Poisson distribution • In a Poisson distribution – The mean = the variance (hence heteroscedasticity issue)) –   2 633 Poisson Probabilities Mean Score 0 1 2 3 4 5 6 7 8 9 10 1 0.37 0.37 0.18 0.06 0.02 0.00 0.00 0.00 0.00 0.00 0.00 2 0.14 0.27 0.27 0.18 0.09 0.04 0.01 0.00 0.00 0.00 0.00 3 0.05 0.15 0.22 0.22 0.17 0.10 0.05 0.02 0.01 0.00 0.00 10 0.00 0.00 0.00 0.01 0.02 0.04 0.06 0.09 0.11 0.13 0.13 634 Issues with Estimation • Just as with logistic – We can’t predict a mean below zero • Don’t predict the mean – Predict the log of the mean 635 Poisson Regression in Stata • Adult literacy study – Number of sessions attended – Count variable • Poisson regression 636 poisson sessions tx ---------------------------------------------------------------------------sessions | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------+---------------------------------------------------------------tx | -.2359546 .06668 -3.54 0.000 -.366645 -.1052642 _cons | 1.899973 .046225 41.10 0.000 1.809374 1.990572 ---------------------------------------------------------------------------- poisson sessions tx, irr -----------------------------------------------------------------------------sessions | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------tx | .7898166 .052665 -3.54 0.000 .6930557 .9000867 ------------------------------------------------------------------------------ 637 But was it Poisson? • Look at predicted probabilities – Compare with actual probabilities • Predicted means – Control: exp(1.899) = 6.86 – Intervention: exp(1.899-0.236) = 5.28 • Get means and SDs 638 bysort tx: sum sessions -> tx = 0 VObs Mean Std. Dev. -------------+-------------------------------------sessions | 70 6.685714 3.495516 ----------------------------------------------------> tx = 1ariable | Variable | Obs Mean Std. Dev. -------------+-------------------------------------sessions | 82 5.280488 2.709263 639 • Do OK on the means – Don’t do OK on the variances – Variances are too high • Compare predicted probabilities with actual probabilities • tab session tx,col nofreq • Draw graphs – Not horrible – Except the zeroes 640 0.25 0.20 0.15 Predicted - control Predicted - intervention Actual - control Actual - intervention 0.10 0.05 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test for Goodness of Fit to Poisson Distribution • After running Poisson – estat gof Goodness-of-fit chi2 Prob > chi2(150) = = 314.139 0.0000 • Highly significant – Poisson distribution doesn’t fit 642 Overdispersion • Problem in Poisson regression – Too many zeroes • Causes – c2 inflation – Standard error deflation • Hence p-values too low – Higher type I error rate • Two solutions – Negative binomial regression – Robust standard errors 643 Robust Standard Errors poisson sessions tx, robust --------------------------------------------------------| Robust sessions | Coef. Std. Err. z P>|z| -------------+------------------------------------------tx | -.2359546 .0840648 -2.81 0.005 _cons | 1.899973 .0622477 30.52 0.000 --------------------------------------------------------- • Robust SEs are larger 644 Negative Binomial Regression • Adds a ‘hurdle’ to account for the zeroes – Called alpha • nbreg sessions tx • OR • nbreg sessions tx, robust 645 Back to Categorical Outcomes • We said: – Odds ratios are not good • We like relative risk instead – What is the ratio of the risks? • What analysis technique do we know that gives ratios of means 646 • Poisson regression! • Wait. It won’t work. The distribution is wrong. • Robust estimates! 647 Poisson Regression in SPSS • SPSS 15 (and above), has added it • Under generalized linear models 648 Lesson 11: Mediation and Path Analysis 649 Introduction • Moderator – Level of one variable influences effect of another variable • Mediator – One variable influences another via a third variable • All relationships are really mediated – are we interested in the mediators? – can we make the process more explicit 650 • In examples with bank education beginning salary • Why? – What is the process? – Are we making assumptions about the process? – Should we test those assumptions? 651 job skills expectations beginning salary education negotiating skills kudos for bank 652 Direct and Indirect Influences X may affect Y in two ways • Directly – X has a direct (causal) influence on Y – (or maybe mediated by other variables) • Indirectly – X affects Y via a mediating variable - M 653 • e.g. how does going to the pub effect comprehension on a Summer school course – on, say, regression not reading books on regression Having fun in pub in evening less knowledge Anything here? 654 not reading books on regression Having fun in pub in evening less knowledge fatigue Still needed? 655 • Mediators needed – to cope with more sophisticated theory in social sciences – make explicit assumptions made about processes – examine direct and indirect influences 656 Detecting Mediation 657 “Classic Approach” 4 Steps From Baron and Kenny (1986) • To establish that the effect of X on Y is mediated by M 1. Show that X predicts Y 2. Show that X predicts M 3. Show that M predicts Y, controlling for X 4. If effect of X controlling for M is zero, M is complete mediator of the relationship • (3 and 4 in same analysis) 658 Example: Book habits Enjoy Books  Buy books  Read Books 659 Three Variables • Enjoy – How much an individual enjoys books • Buy – How many books an individual buys (in a year) • Read – How many books an individual reads (in a year) 660 ENJOY BUY READ ENJOY BUY READ 1.00 0.64 0.73 0.64 1.00 0.75 0.73 0.75 1.00 661 • The Theory enjoy buy read 662 • Step 1 1. Show that X (enjoy) predicts Y (read) – b1 = 0.487, p < 0.001 – standardised b1 = 0.732 – OK 663 2. Show that X (enjoy) predicts M (buy) – b1 = 0.974, p < 0.001 – standardised b1 = 0.643 – OK 664 3. Show that M (buy) predicts Y (read), controlling for X (enjoy) – b1 = 0.469, p < 0.001 – standardised b1 = 0.206 – OK 665 4. If effect of X controlling for M is zero, M is complete mediator of the relationship – (Same as analysis for step 3.) – b2 = 0.287, p = 0.001 – standardised b2 = 0.431 – Hmmmm… • Significant, therefore not a complete mediator 666 0.287 (step 4) enjoy read buy 0.974 (from step 2) 0.206 (from step 3) 667 The Mediation Coefficient • Amount of mediation = Step 1 – Step 4 =0.487 – 0.287 = 0.200 • OR Step 2 x Step 3 =0.974 x 0.206 = 0.200 668 SE of Mediator enjoy buy a (from step 2) read b (from step 2) • sa = se(a) • sb = se(b) 669 • Sobel test – Standard error of mediation coefficient can be calculated se  b s + a s - s s 2 2 a a = 0.974 sa = 0.189 2 2 b 2 2 a b b = 0.206 sb = 0.054 670 • Indirect effect = 0.200 – se = 0.056 – t =3.52, p = 0.001 • Online Sobel test: http://quantpsy.org 671 Problems with the Sobel test • Recently – Move in methodological literature away from this conventional approach • Problems of power: – Several tests, all of which must be significant • Type I error rate = 0.05 * 0.05 = 0.0025 • Must affect power 672 • Distributional Assumption – We assume that the sampling distribution of the coefficient is normally distributed • Standard error is standard deviation • If: – a (x  m) is normal and not zero – b (m  y) is normal and not zero • Then: – a×b – Is not normally distributed • Assumption is violated – Test is incorrect 673 • Solution: – Bootstrap • Computer intensive semi-parametric procedure • Removes distributional assumption – Bootstrapping suggested as alternative • For Stata: • www.ats.ucla.edu/stat/stata/faq/mediat ion_cativ.htm • For SAS, SPSS: – www.quantpsy.org 674 Cross Sectional Bias • If everything is measured at one time – Likely to be bias • Ideally: – Three variables, measured on three occasions 675 x x x m m m y y y 676 • Kind of hard work – Collecting data on three occasions • BUT: Stationarity assumption can save us x x x m m m y y y 677 • We assume that the effect from M to Y is stable over time – Only need two time points • Cole and Maxwell (2003) 678 Power in Mediation • Really hard to work out • Need to run simulations • Power depends on – Size of a – Size of b • Fritz and Mackinnon (2007) – Table of power for different effects 679 More Information on Mediation • Mackinnon, Fritz and Fairchild – Annual Review of Psychology • Mackinnon – Introduction to statistical mediation • Iacobucci – Mediation analysis (little green book) • Mackinnon’s website (Google: mackinnon mediation) • Facebook group – (No, really) 680 Lesson 12: Moderators in Regression “different slopes for different folks” 681 Introduction • Moderator relationships have many different names – interactions (from ANOVA) – multiplicative – non-linear (just confusing) – non-additive • All talking about the same thing 682 A moderated relationship occurs • when the effect of one variable depends upon the level of another variable 683 • Hang on … – That seems very like a nonlinear relationship – Moderator • Effect of one variable depends on level of another – Non-linear • Effect of one variable depends on level of itself • Where there is collinearity – Can be hard to distinguish between them – Paper B5 – Should (usually) compare effect sizes 684 • e.g. How much it hurts when I drop a computer on my foot depends on – x1: how much alcohol I have drunk – x2: how high the computer was dropped from – but if x1 is high enough – x2 will have no effect 685 • e.g. Likelihood of injury in a car accident – depends on – x1: speed of car – x2: if I was wearing a seatbelt – but if x1 is low enough – x2 will have no effect 686 30 25 Injury 20 15 10 5 0 5 15 25 35 45 Speed (mph) Seatbelt No Seatbelt 687 • e.g. number of words (from a list) I can remember – depends on – x1: type of words (abstract, e.g. ‘justice’, or concrete, e.g. ‘carrot’) – x2: Method of testing (recognition – i.e. multiple choice, or free recall) – but if using recognition – x1: will not make a difference 688 • We looked at three kinds of moderator • alcohol x height = pain – continuous x continuous • speed x seatbelt = injury – continuous x categorical • word type x test type – categorical x categorical • We will look at them in reverse order 689 How do we know to look for moderators? Theoretical rationale • Often the most powerful • Many theories predict additive/linear effects – Fewer predict moderator effects Presence of heteroscedasticity • Clue there may be a moderated relationship missing 690 Two Categorical Predictors 691 • 2 IVs Data – word type (concrete [e.g. Carrot, table], abstract [e.g. Love, justice) – test method (multiple choice, recall ) • 20 Participants in one of four groups – – – – Concrete, MC Concrete, recall Abstract, MC Abstract, recall • 5 per group • lesson12.1-words.dta 692 MC Mean Concrete SD Mean Abstract SD Mean Total SD 15.4 2.5 15.6 1.5 15.5 2.0 Recall Total 15.2 15.3 3.2 2.7 7.0 11.3 1.6 4.8 11.1 13.3 4.9 4.3 693 • Graph of means 20 15 10 Concrete Abstract 5 0 MC Recall 694 Procedure for Testing 1: Convert to dummy coding – Already done 2: Calculate interaction term – Multiply dummy codes together – (Can also use xi: for this) – Call interaction mxc 695 • Interaction term (wxt) – multiply effect coded variables together Concrete 0 0 1 1 MC 0 1 0 1 mxc 0 0 0 1 696 3: Carry out regression – Hierarchical – linear effects first – interaction effect in next block 697 • b0(intercept)= 7.0 – Mean score when MC = 0 and concrete = 0 • b1 (mc) = 8.6 – When concrete is zero, effect of MC • b2 (concrete) = 8.2 – When mc is zero, effect of concrete MC Recall Total Concrete Abstract Total 15.4 15.2 15.30 15.6 7 11.30 15.50 11.10 13.30 698 • b3 (mc x con) = -8.4 – grand mean • Given other estimates, what’s the predicted mean of concrete, MC? – 7.0 + 8.6 + 8.2 = 23.8 • What is it? – 15.4 Recog Recall Total Concrete Abstract Total 15.4 15.2 15.3 15.6 7.0 11.3 15.5 11.1 13.3 699 • Have: • Expect: • Difference: 15.4 23.8 -8.4 700 Back to the Graph Slope for concrete words 15.2-15.4=-0.2 20 15 10 Concrete Abstract 5 Difference in slopes 0 -8.6-(-0.2) = -8.4 MC Recall Slope for abstract words 7.0-15.6=-8.6 701 b associated with interaction • The difference in the slopes OR • The change in slope, away from the average, associated with a 1 unit change in the moderating variable 702 • Another way to look at it Y = 7 + 8.6  m + 8.2  c + -8.4  m  c • Examine concrete words group (c = 1) – substitute values into the equation Y(conc) = Y = 7 + 8.6  m + 8.2  1 + -8.4  m  1 Y(conc) = Y = 7 + 8.6  m + 8.2 + -8.4  m Y(conc) = Y = 7 + 8.2 + 8.6  m -8.4  m Y(conc) = Y = 15.2 + 0.2  m 703 Categorical x Continuous 704 Note on Dichotomisation • Very common to see people dichotomise a variable – Makes the analysis easier – Very bad idea • Paper B6 705 Data A chain of 60 supermarkets • examining the relationship between profitability, shop size, and local competition • 2 IVs – shop size – comp (local competition, 0=no, 1=yes) • outcome – profit 706 • Data, ‘lesson 12.2.dta’ Shopsize 4 10 7 10 10 29 12 6 14 62 Comp 1 1 0 0 1 1 0 1 0 0 Profit 23 25 19 9 18 33 17 20 21 8 707 1st Analysis Two IVs • R2=0.367, df=2, 57, p < 0.001 • Unstandardised estimates – b1 (shopsize) = 0.083 (p=0.001) – b2 (comp) =- 5.883 (p<0.001) • Standardised estimates – b1 (shopsize) = 0.356 – b2 (comp) = 0.448 708 • Suspicions – Presence of competition is likely to have an effect – Residual plot shows a little heteroscedasticity 709 10 5 -5 0 Residuals -10 15 20 25 Linear prediction 30 Procedure for Testing • Very similar to last time – convert ‘comp’ to dummy coding • (if it’s not already) – Compute interaction term • comp (effect coded) x size – Hierarchical regression 711 Result • Estimates – b1 (shopsize) = 0.12, SE = 0.03 – b2 (comp) = -1.67, SE 2.50 – b3 (sxc) = -0.10, SE 0.05 712 • comp now non-significant – shows importance of hierarchical – it obviously is important 713 Interpretation • Draw graph with lines of best fit – graph twoway (scatter profit shopsize if comp==1) (lfit profit shopsize if comp==1) (scatter profit shopsize if comp==0) (lfit profit shopsize if comp==0), legend(off) 714 40 30 20 10 0 0 20 40 60 80 100 shopsize 715 • Substitute into equation • Effects of size – (can ignore the constant) • Y=size0.12 + comp(-1.67) + sizecomp(-0.09) – Competition present (comp = 1) • Y=size0.12 + 1(-1.67) + size1(-0.09) • Y=size0.12 + size (-0.09) • Y=size0.03 716 • Y=size0.12 + comp(-1.67) + sizecomp(-0.09) – Competition present (x2 = 0) • Y=size0.12 + 0 • Y=size0.12 (-1.67) + size 0 (-0.09) 717 Two Continuous Variables 718 Data • Bank Employees – only using clerical staff – 363 cases – predicting starting salary – previous experience – age – age x experience – (exercise 6.3) 719 • Correlation matrix – only one significant LOGSB AGESTARTPREVEXP LOGSB 1.00 -0.09 0.08 AGESTART -0.09 1.00 0.77 PREVEXP 0.08 0.77 1.00 720 Initial Estimates (no moderator) • (standardised) – R2 = 0.063, p<0.001 – Age at start = -0.37, p<0.001 – Previous experience = 0.36, p<0.001 • Suppressing each other – Age and experience compensate for one another – Older, with no experience, bad – Younger, with experience, good 721 The Procedure • Very similar to previous – create multiplicative interaction term – BUT • Center variables (subtract mean) – Not always necessary – Can make life easier 722 • Hierarchical regression – two linear effects first – moderator effect in second 723 • Change in R2 – 0.085, p<0.001 • Estimates (standardised) – b1 (agestart) = -0.52 – b2 (prevexp) = 0.93 – b3 (age x exp) = -0.56 724 Interpretation 1: Pick-a-Point • Graph is tricky – can’t have two continuous variables – Choose specific points (pick-a-point) • Graph the line of best fit of one variable at others – Two ways to pick a point • 1: Choose high (z = +1), medium (z = 0) and low (z = -1) • Choose ‘sensible’ values – age 20, 50, 80? 725 • We know: – Y = e  0.94 + a  -0.53 + a  e  -0.58 – Where a = agestart, and e = experience • We can rewrite this as: – Y = (e  0.94) + (a  -0.53) + (a  e  -0.58) – Take a out of the brackets – Y = (e  0.94) + (-0.53 + e  -0.58)a • Bracketed terms are simple intercept and simple slope – intercept and slope for agestart – 0= (e  0.10) – 1= (-0.53 + e  -0.58)a – Y = 0 + 1a 726 • Pick any value of e, and we know the slope for a – Standardised, so it’s easy • e = -1 – 0= (-1  0.94) = -0.94 – 1= (-0.53 + -1  -0.58)a = -0.05a • e=0 – 0= (0  0.10) = 0 – 1= (-0.53+ 0  -0.58)a = -0.53a • e=1 – 0= (1  0.10) = 0.10 – 1= (-0.53 + 1  -0.58)a = -1.11a 727 Graph the Three Lines 1.5 1 e = -1 e=0 e=1 Log(salary) 0.5 0 -0.5 -1 -1.5 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 Age 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 728 Do This in Stata • The easy way • Create some pseudo cases – Some fake people – With sensible scores for the variables – Regression equation ‘stays behind’ • Calculate predicted scores with predict – (Can be in a new dataset) 729 • Then draw graph • drop if _n < 364 • graph twoway (lfit pred agestart if prevexp==0, lcolor(red)) (lfit pred agestart if prevexp==85, lcolor(black)) (lfit pred agestart if prevexp==170, lcolor(green)) , legend(off) 731 10 9.8 9.6 9.4 20 30 40 agestart 50 • (Also works in SPSS; in SAS, use Proc Score) 733 Interpretation 2: P-Values and CIs • Second way – Newer, rarely done • Calculate CIs of the slope – At any point • Calculate p-value – At any point • Give ranges of significance 734 What do you need? • The variance and covariance of the estimates – SPSS doesn’t provide estimates for intercept – Need to do it manually • In options, exclude intercept – Create intercept – c = 1 – Use it in the regression 735 • Enter information into web page: • www.people.ku.edu/~preacher/interact /mlr2.htm • Get results • Calculations in Bauer and Curran (in press: Multivariate Behavioral Research) – Paper B13 736 4.1 4.2 Y 4.3 4.4 4.5 MLR 2-Way Interaction Plot 4.0 CVz1(1) CVz1(2) CVz1(3) -1.0 -0.5 0.0 X 0.5 1.0 737 Areas of Significance 0.0 -0.2 -0.4 -0.6 Simple Slope 0.2 0.4 Confidence Bands -4 -2 0 2 4 Experience 738 • 2 complications – 1: Constant differed – 2: outcome was logged, hence non-linear • effect of 1 unit depends on where the unit is – See paper A2 739 Finally … 740 Unlimited Moderators • Moderator effects are not limited to – 2 variables – linear effects 741 Three Interacting Variables • Age, Sex, Exp • Block 1 – Age, Sex, Exp • Block 2 – Age x Sex, Age x Exp, Sex x Exp • Block 3 – Age x Sex x Exp 742 • Results – All two way interactions significant – Three way not significant – Effect of Age depends on sex – Effect of experience depends on sex – Size of the age x experience interaction does not depend on sex (phew!) 743 Moderated Non-Linear Relationships • Enter non-linear effect • Enter non-linear effect x moderator – if significant indicates degree of nonlinearity differs by moderator 744 745 Lesson 13: Longitudinal Models 746 Advantages of Longitudinal Data • You get more data from the same number of people • You can test causal relationships – Although you can’t rule them out • You can examine change • You can control for individual differences 747 Disadvantage of Longitudinal Data • It’s much harder to analyze 748 Longitudinal Research • For comparing repeated measures – Clusters are people • Data are usually short and fat ID V1 V2 V3 V4 1 2 3 4 7 2 3 6 8 4 3 2 5 7 5 749 Converting Data • Change data to tall and thin • Use reshape in stata • Use Data, Restructure in SPSS • Clusters are ID ID V X 1 1 2 1 2 3 1 3 4 1 4 7 2 1 3 2 2 6 2 3 8 2 4 4 3 1 2 3 2 5 3 3 7 3 4 5 750 Predict Salary Change • Use exercise5.3-bank salary.dta – Compare beginning salary and salary – Would normally use paired samples t-test • Difference = $17,403, 95% CIs $16,427.407, $18,379.555 751 Predict Salary Change • Don’t take the difference in salary and salbegin – Why not? • reg salary agestart salbegin • Est: -207.8, 95% CIs -267.4, -148.2 752 Restructure the Data • gen id = _n • rename salbegin sal1 • rename salary sal2 • reshape long sal, i(id) j(t) • replace t = t-1 753 Restructure the Data • Do it again – With data tall and thin • Do a regression – What do we find? ID Time Cash 1 0 $18,750 1 1 $21,450 2 0 $12,000 2 1 $21,900 3 0 $13,200 3 1 $45,000 754 Results • We have violated the independence assumption • We have the wrong answer • Simplest way to solve it: • regress sal t, cluster(id) • Assumes that ID is just an irritant – Rather inflexible 755 However … • That has one advantage – Missing data doesn’t mean that we exclude the case – If data are missing at random (or missing completely at random) estimates will be unbiased 756 • If everyone has – Score at time 1 – Score at time 2 • Analysis is easy • If half the people have – Score at time 1 – Score at time 2 • Analysis is easy • But what if some have – Time 1 – Time 2 – Time 1 and 2 757 • If – Missing at random T1 • MAR T2 – Missing completely at random 10 15 7 6 8 2 5 8 • MCAR – (Crappy names) • No problem Interesting … • That wasn’t very interesting – What is more interesting is when: • We have missing data – Which we won’t talk about more (much) • We have multiple measurements of the same people – Which we will talk about 759 Modelling Change • Can plot and assess trajectories over time • How do people change? • What predicts the rate of change? 760 Plotting Individuals Salary Person 1 T1 T2 761 Plotting Individuals Person 3 Salary Person 1 Person 2 T1 T2 762 0 0 100000 150000 50000 0 100000 150000 50000 0 100000 150000 50000 0 100000 150000 50000 0 100000 150000 50000 0 100000 150000 50000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 .5 Graphs by id 1 0 .5 1 0 .5 1 t 0 .5 1 0 .5 1 0 .5 1 0 50000 0 .2 .4 .6 t .8 1 Estimation • Each individual has an intercept – Sampled from the population of intercepts • Each individual has a slope – Sampled from the population of slopes • Can we estimate the average of each – And a measure of their variance? • Yes! With multilevel models 765 Multilevel Models • Can do all kinds of clever things – We won’t worry about most of them • Used when – Level 1 units (measures) – Are nested within – Level 2 units (people) • Same person measured twice – Violated indepence 766 Levels • In regression – Everything is at one level • In multilevel models – We have multiple levels • Hierarchical levels (hence hierarchical linear models) • Random effects (random effects models) • Mixed effects (mixed models) 767 Levels • Level 1 units – First level of measurement – Are clustered within • Level 2 units – Second level of measurement 768 Some Equations • (This is very hard. It’s not important). • In regression yi  b0  b1 xi1  ei • If x is time – And we have one person – Reference time with I – And call it T 769 • Single person equation yi  b0  b1Ti  ei • But what if we have lots of people? – We’ve used i for time – We’ll use j for people yij  b0  b1Tij  eij • But everything is fixed – We want to have some random effects 770 • Let’s make intercepts random – Everyone has their own intercept yij  b0 j  b1Tij  eij • Look! We added a little j – Now it’s a multilevel model • And we need an equation for the intercept 771 • Equation for each person’s intercept b0 j  g 00  0 j • Your intercept (b0j) is equal to: – Mean intercept • g00 (Gamma) – Plus residual (for that individual) • 0j (mu) – This is level 2 model • Level 2 residuals • i.i.d, etc 772 • So now we have yij  b0 j b1 jTij  eij b0 j  g 00  0 j • Or yij  (g 00  0 j )  b1Tij  eij 773 Make Time Random • Value of the time parameter can vary – Amongst people – Everyone can have a different effect of time yij  b0 j  b1 jTij  eij 774 • Time is random – Everyone has a slope parameter btj  g t 0   tj • So: yij  b0 j  b1Tij  eij b0 j  g 00   0 j b1 j  g 10  1 j 775 • Or yij  b0 j  b1 jTij  eij b0 j  g 00   0 j btj  g t 0  tj yij  g t 0  1 j   g 10  1 j   eij 776 Time Invariant Covariates • Can be added at level 2 – We can predict person’s intercept • Starting point – And rate of change 777 Employee Data • Level 1: – Pay measures • Two of them – Clustered within • Level 2: – People – Level 2 measures: age, sex, job, etc 778 Regression with Time • Do a regression analysis – On one person • Time is the predictor • Get a regression line for that person 779 Fixed vs Random Effects • Fixed effects – Effect is the same across all clusters (people) – Variation is only measurement error • Random effects – Effect varies across people – Additional parameter in the model • Less parsimonious 780 Fixed vs Random Effects • If an effect has variance – It might have covariance – With any other effects which also have variance • More parameters – Less parsimony 781 Covariates • Two kinds of covariates – Time invariant • Fixed for a person – Age when study started – Sex – Time variant • Can change over time – Time – Marital Status 782 Time Invariant • Look at effect of age – Add age to the fixed effects – Is that significant? – Are random effects (still?) significant 783 Multilevel Models in Stata • Use xtmixed – Stata 10 added xtmelogit, xtmepoisson • Continuous variables are hard enough • In SPSS, continuous variables only 784 Multilevel Models in Stata • xtmixed sal t • Does regression • Need to tell it about the people: • xtmixed sal t ||id: 785 -----------------------------------------------------------------------------sal | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------t | 17403.48 496.7321 35.04 0.000 16429.9 18377.06 _cons | 17016.09 610.6691 27.86 0.000 15819.2 18212.98 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Identity | sd(_cons) | 10875.87 449.592 10029.44 11793.73 -----------------------------+-----------------------------------------------sd(Residual) | 7647.093 248.6285 7174.992 8150.258 -----------------------------------------------------------------------------LR test vs. linear regression: chibar2(01) = 280.88 Prob >= chibar2 = 0.0000 786 Average -----------------------------------------------------------------------------slope sal | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------t | 17403.48 496.7321 35.04 0.000 16429.9 18377.06 _cons | 17016.09 610.6691 27.86 0.000 15819.2 18212.98 ------------------------------------------------------------------------------ Average -----------------------------------------------------------------------------intercept Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] SD of the -----------------------------+-----------------------------------------------id: Identity | slopes sd(_cons) | 10875.87 449.592 10029.44 11793.73 -----------------------------+-----------------------------------------------sd(Residual) | 7647.093 248.6285 7174.992 8150.258 -----------------------------------------------------------------------------LR test vs. linear regression: chibar2(01) = 280.88 Prob >= chibar2 = 0.0000 SD of the slopes 787 • Gives random intercepts only – Let’s look at them • predict rand_int • xtline rand_int , overlay t(t) i(id) legend(off) 788 0 .2 .4 .6 t .8 1 0 20000 40000 60000 80000 Random Slopes • Everyone has the same slope – Maybe that’s not true – Make slopes random • xtmixed sal t ||id: t 790 -----------------------------------------------------------------------------sal | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------t | 17403.48 496.7273 35.04 0.000 16429.91 18377.05 _cons | 17016.09 361.5138 47.07 0.000 16307.53 17724.64 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Independent | sd(t) | 10814.52 351.6071 10146.88 11526.09 sd(_cons) | 7870.712 255.9014 7384.801 8388.595 -----------------------------+-----------------------------------------------sd(Residual) | .709687 .3436148 .2747476 1.833158 ------------------------------------------------------------------------------ • predict rand_int_slope, fitte • xtline rand_int_slope , overlay t(t) i(id) legend(off) 792 0 50000 0 .2 .4 .6 t .8 1 Structure of the Covariances • We have been forcing slope and intercepts to be uncorrelated • Let’s correlate them • xtmixed sal t ||id: t , cov(un) 794 -----------------------------------------------------------------------------sal | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------t | 17403.48 496.7316 35.04 0.000 16429.91 18377.06 _cons | 17016.09 361.5085 47.07 0.000 16307.54 17724.63 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Unstructured | sd(t) | 10103.18 11670.32 1050.081 97205.97 sd(_cons) | 7382.783 7985.867 886.1058 61511.25 corr(t,_cons) | .8550486 3.491476 -1 1 -----------------------------+-----------------------------------------------sd(Residual) | 2727.788 21601.34 .0004956 1.50e+10 ------------------------------------------------------------------------------ Predicting Change • Does another variable moderate the effect of time – This means that the effect of time varies – As a function of the slope • xi: xtmixed sal i.t*agestart ||id: t 796 -----------------------------------------------------------------------------sal | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_It_1 | 25483.16 1640.544 15.53 0.000 22267.75 28698.57 agestart | -5.157022 30.85168 -0.17 0.867 -65.6252 55.31115 _ItXagest_1 | -212.5138 41.25203 -5.15 0.000 -293.3663 -131.6613 _cons | 17205.18 1226.935 14.02 0.000 14800.43 19609.93 ----------------------------------------------------------------------------------------------------------------------------------------------------------Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+-----------------------------------------------id: Unstructured | sd(t) | 9871.434 9785.792 1414.372 68896.47 sd(_cons) | 7437.632 6495.359 1342.987 41190.55 corr(t,_cons) | .8622535 2.921366 -1 1 -----------------------------+-----------------------------------------------sd(Residual) | 2619.971 18422.9 .0027095 2.53e+09 ------------------------------------------------------------------------------ Exercises • 13.1, 13.2 798 Fixed Effects Models • A second way of looking at longitudinal data • Multilevel (mixed) models – Assume that intercepts are random • Fixed effects models – Assume they are fixed – If they are fixed they can correlate • With all other predictors 799 Fixed Effects Models • Allowing intercepts to correlate – Has the effect of controlling for ALL time invariant predictors – Even those you didn’t measure – Each person is their own control 800 Fixed Effects Models • Regression asks: – Are people who are higher on x also higher on y? • Fixed effects asks: – When a person is higher on x are they also higher on y – Effects are within people, not between people 801 Fixed Effects in Stata • Make data long, then – xtreg sal t agestart, i(id) 802 -----------------------------------------------------------------------------sal | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------t | 17403.48 496.7319 35.04 0.000 16427.41 18379.56 _cons | 17016.09 351.2425 48.45 0.000 16325.9 17706.28 -------------+---------------------------------------------------------------sigma_u | 12145.928 sigma_e | 7647.0911 rho | .71612838 (fraction of variance due to u_i) -----------------------------------------------------------------------------F test that all u_i=0: F(473, 473) = 5.05 Prob > F = 0.0000 Fixed Effects Regression • Can we look at the effect of time invariant predictors? • xtreg sal t agestart, i(id) fe • Why not? 804 Interactions • But we can look at interactions of time invariant predictors • xi: xtreg sal i.t*agestart, i(id) fe 805 -----------------------------------------------------------------------------sal | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------_It_1 | 25483.16 1640.538 15.53 0.000 22259.48 28706.84 agestart | (dropped) _ItXagest_1 | -212.5138 41.25188 -5.15 0.000 -293.5743 -131.4533 _cons | 17009.25 342.8103 49.62 0.000 16335.62 17682.88 -------------+---------------------------------------------------------------sigma_u | 12087.77 sigma_e | 7455.6318 rho | .72441115 (fraction of variance due to u_i) ------------------------------------------------------------------------------ Exercises • 13.3, 13.4 807 Bonus Lesson 1: Why Regression? A little aside, where we look at why regression has such a curious name. 808 Regression The or an act of regression; reversion; return towards the mean; return to an earlier stage of development, as in an adult’s or an adolescent’s behaving like a child (From Latin gradi, to go) • So why name a statistical technique which is about prediction and explanation? 809 • Francis Galton – Charles Darwin’s cousin – Studying heritability • Tall fathers have shorter sons • Short fathers have taller sons – ‘Filial regression toward mediocrity’ – Regression to the mean 810 • Galton thought this was biological fact – Evolutionary basis? • Then did the analysis backward – Tall sons have shorter fathers – Short sons have taller fathers • Regression to the mean – Not biological fact, statistical artefact 811 Other Examples • Secrist (1933): The Triumph of Mediocrity in Business • Second albums often tend to not be as good as first • Sequel to a film is not as good as the first one • Sports Illustrated Cover Jinx • Parents think that punishing bad behaviour works, but rewarding good behaviour doesn’t 812 • Accident reduction schemes – Always reduce accidents • Poor radiologists improve after training • Any treatment for a cold will work – Or for most illnesses • Deaths due to methadone in Utah – High last year – Must take action! 813 Pair Link Diagram • An alternative to a scatterplot x y 814 r=1.00 x x x x x x x 815 r=0.00 x x x x x 816 From Regression to Correlation • Where do we predict an individual’s score on y will be, based on their score on x? – Depends on the correlation • r = 1.00 – we know exactly where they will be • r = 0.00 – we have no idea • r = 0.50 – we have some idea 817 r=1.00 Starts here Will end up here x y 818 r=0.00 Starts here Could end anywhere here x y 819 r=0.50 Probably end somewhere here Starts here x y 820 Galton Squeeze Diagram • Don’t show individuals – Show groups of individuals, from the same (or similar) starting point – Shows regression to the mean 821 r=0.00 Ends here Group starts here x Group starts here y 822 r=0.50 x y 823 r=1.00 x y 824 1 unit r units x y • Correlation is amount of regression that doesn’t occur 825 • No regression • r=1.00 x y 826 • Some regression • r=0.50 x y 827 r=0.00 • Lots (maximum) regression • r=0.00 x y 828 Formula zˆ y  rxy z x 829 Conclusion • Regression towards mean is statistical necessity regression = perfection – correlation • Very non-intuitive • Interest in regression and correlation – From examining the extent of regression towards mean – By Pearson – worked with Galton – Stuck with curious name • See also Paper B3 830 • Correcting for regression to the mean – Possible – Makes lots of tricky assumptions • To appear to do well in your job / life – Do something after someone has failed – You probably can’t do worse – Hospital / school / department / class / study / experiment • If it fails, volunteer to do it 831 Bonus Lesson 2: Other Kinds of Regression 832 Introduction • We’ve covered a few kinds of regression – There are many more, for specific types of outcomes 833 Beta Regression • Used when the outcome variable is beta distributed • Rates and proportions – Bounded by zero and 1 – Uniform, or strange shaped distributions 834 Cox Proportional Hazards Regression • Type of survival model • Used for time to an event – When the event might not occur • Developed for medical research 835 Cox Proportional Hazards Regression • E.g. How long does it take for a car to break down – The car crashes, and is scrapped – We’ll never know – but we want to know the information – Discarding the data point would lead to bias 836 Competing Risks Survival • Time to multiple events – Several things are trying to kill you – Which one succeeds 837 Data Mining Techniques • Avoid problems with stepwise regression – Used as alternatives to logistic • Boosted regression (Stata command: boost) – Semi-parametric alternative to logistic regression • Least Angle Regression (LARS) 838 • Classification trees 839 Seemingly Unrelated Regression • Used for multiple outcomes – With correlated error terms • Some say it should be seemingly related • Stata command: sureg 840 Instrumental Variables Regression • Used for mediator models – Although economists don’t call them that. • Use ivregress in Stata 841 Quantile Regression • Why do we always try to predict the mean? • What about predicting the median? The 25th percentile? • That’s quantile regression 842 Non-Parametric Regression • (Might also include LARS/Lasso/Boost) • Don’t force any functional form on the relationship • LO(W)ESS – locally weighted scatterplot smoothing – Will find any relationship 843 Robust Regression • (Careful, not sandwich estimators) • Trimmed of outliers, estimated with bootstrap • Lots of publications by Wilcox 844 Censored Regression • Tobit regression – When a measure is censored – E.g. unemployed people work 0 hours. 845

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Theory of Regression - Jeremy Miles`s Page