Survey

Document related concepts

Regression toward the mean wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Confidence interval wikipedia, lookup

Types of artificial neural networks wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

Transcript

Linear regression Brian Healy, PhD BIO203 Previous classes Hypothesis testing – Parametric – Nonparametric Correlation What are we doing today? Linear regression – Continuous outcome with continuous, dichotomous or categorical predictor – Equation: E (Y | X x) 0 1 x Interpretation of coefficients Connection between regression and – correlation – t-test – ANOVA Big picture Linear regression is the most commonly used statistical technique. It allows the comparison of dichotomous, categorical and continuous predictors with a continuous outcome. Extensions of linear regression allow – Dichotomous outcomes- logistic regression – Survival analysis- Cox proportional hazards regression – Repeated measures Amazingly, many of the analyses we have learned can be completed using linear regression Example Yesterday, we investigated the association between age and BPF using a correlation coefficient Can we fit a line to this data? .75 .8 BPF .85 .9 .95 20 30 40 Age 50 60 Quick math review As you remember from high school math, the basic equation of a line is given by y=mx+b where m is the slope and b is the y-intercept One definition of m is that for every one unit increase in x, there is an m unit increase in y One definition of b is the value of y when x is equal to zero Line 20 18 16 y = 1.5x + 4 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 Picture Look at the data in this picture Does there seem to be a correlation (linear relationship) in the data? Is the data perfectly linear? Could we fit a line to this data? 25 20 15 10 5 0 0 2 4 6 8 10 12 What is linear regression? Linear regression tries to find the best line (curve) to fit the data The method of finding the best line (curve) is least squares, which minimizes the sum of the distance from the line for each of points 25 20 y = 1.5x + 4 15 10 5 0 0 2 4 6 8 10 12 How do we find the best line? Let’s look at three candidate lines Which do you think is the best? What is a way to determine the best line to use? Residuals The actual observations, yi, may be slightly off the population line because of variability in the population. The equation is yi = 0 + 1xi + ei, where ei is the deviation from the population line (See picture). This is called the residual This is the distance from the line for patient 1, e1 Least squares The method employed to find the best line is called least squares. This method finds the values of that minimize the squared vertical distance from the line to each of the point. This is the same as minimizing the sum of the ei2 n e i 1 n 2 i yi 0 1 x1 i 1 2 Estimates of regression coefficients Once we have solved the least squares equation, we obtain estimates for the ’s, which we refer n to ˆ0 , ˆ1 as ˆ1 x x y i 1 i i y x x n i 1 2 i ˆ0 y ˆ1 x The final least squares equation is where yhat is the mean value of y for a value of x1 yˆ ˆ0 ˆ1 x1 Assumptions of linear regression Linearity – Linear relationship between outcome and predictors – E(Y|X=x)=0 + 1x1 + 2x22 is still a linear regression equation because each of the ’s is to the first power Normality of the residuals – The residuals, ei, are normally distributed, N(0, s2 Homoscedasticity of the residuals – The residuals, ei, have the same variance Independence – All of the data points are independent – Correlated data points can be taken into account using multivariate and longitudinal data methods Linearity assumption One of the assumptions of linear regression is that the relationship between the predictors and the outcomes is linear We call this the population regression line E(Y | X=x) = my|x = 0 + 1x This equation says that the mean of y given a specific value of x is defined by the coefficients The coefficients act exactly like the slope and yintercept from the simple equation of a line from before Normality and homoscedasticity assumption Two other assumptions of linear regression are related to the ei’s – Normality- the distribution of the residuals are normal. – Homoscedasticity- the variance of y given x is the same for all values of x Distribution of y-values at each value of x is normal with the same variance Example Here is a regression equation for the comparison of age and BPF .8 BPF .85 .9 .95 BPFi 0 1agei e i .75 E ( BPF | age) 0 1age 20 30 40 Age 50 60 Results The estimated regression equation .8 .85 .9 .95 BP Fˆ 0.957 0.0029 * age .75 20 30 40 Age BPF 50 predval 60 . regress bpf age Source Model Residual Total SS df MS Number of obs F( 1, 27) Prob > F R-squared Adj R-squared Root MSE .022226034 slope 1 .022226034 Estimated .044524108 27 .001649041 .066750142 bpf Coef. age _cons -.0028799 .957443 28 .002383934 Std. Err. .0007845 .035037 Estimated intercept t -3.67 27.33 P>|t| 0.001 0.000 = = = = = = 29 13.48 0.0010 0.3330 0.3083 .04061 [95% Conf. Interval] -.0044895 .885553 -.0012704 1.029333 Interpretation of regression coefficients The final regression equation is BP Fˆ 0.957 0.0029 * age The coefficients mean – the estimate of the mean BPF for a patient with an age of 0 is 0.957 (0hat) – an increase of one year in age leads to an estimated decrease of 0.0029 in mean BPF (1hat) Unanswered questions Is the estimate of 1 (1hat) significantly different than zero? In other words, is there a significant relationship between the predictor and the outcome? Have the assumptions of regression been met? Estimate of variance for hat ’s In order to determine if there is a significant association, we need an estimate of the variance of 0hat and 1hat seˆ ˆ0 s y| x 1 n x2 n x x i 1 2 i seˆ ˆ1 s y| x n x x i 1 2 i sy|x is the residual variance in y after accounting for x (standard deviation from regression, root mean square error) Test statistic For both regression coefficients, we use a tstatistic to test any specific hypothesis – Each has n-2 degrees of freedom (This is the sample size-number of parameters estimated) What is the usual null hypothesis for 1? ˆ0 0 t seˆ ˆ0 ˆ1 1 t seˆ ˆ1 Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: 1 =0 Continuous outcome, continuous predictor Linear regression Test statistic: t=-3.67 (27 dof) p-value=0.0011 Since the p-value is less than 0.05, we reject the null hypothesis We conclude that there is a significant association between age and BPF . regress bpf age Source SS df Model .022226034 Estimated slope271 Residual .044524108 Total .066750142 bpf Coef. age _cons -.0028799 .957443 28 MS .022226034 p-value .001649041 .002383934 Std. Err. .0007845 .035037 Estimated intercept t -3.67 27.33 P>|t| 0.001 0.000 Number of obs F( 1, 27) Prob > F for slope R-squared Adj R-squared Root MSE = = = = = = 29 13.48 0.0010 0.3330 0.3083 .04061 [95% Conf. Interval] -.0044895 .885553 -.0012704 1.029333 Comparison to correlation In this example, we found a relationship between the age and BPF. We also investigated this relationship using correlation We get the same p-value!! Our conclusion is exactly the same!! There are other relationships we will see later Method Correlation Linear regression p-value 0.0010 0.0010 Confidence interval for 1 As we have done previously, we can construct a confidence interval for the regression coefficients Since we are using a t-distribution, we do not automatically use 1.96. Rather we use the cut-off from the t-distribution ˆ t 1 ˆ , ˆ t ˆ ˆ ˆ * s e * s e / 2, dof 1 1 / 2, dof 1 Interpretation of confidence interval is same as we have seen previously Intercept STATA also provides a test statistic and pvalue for the estimate of the intercept This is for Ho: 0 = 0, which is often not a hypothesis of interest because this corresponds to testing whether the BPF is equal to zero at age of 0 Since BPF can’t be 0 at age 0, this test is not really of interest We can center covariates to make this test important Prediction Prediction Beyond determining if there is a significant association, linear regression can also be used to make predictions Using the regression equation, we can predict the BPF for patients with specific age values – Ex. A patient with age=40 BPFˆ 0.957 0.0029 * 40 0.841 The expected BPF for a patient of age 40 based on our experiment is 0.841 Extrapolation .8 .85 .9 .95 Can we predict the BPF for a patient with age 80? What assumption would we be making? .75 20 30 40 Age BPF 50 predval 60 Confidence interval for prediction We can place a confidence interval around our predicted mean value This corresponds to the plausible values for the mean BPF at a specific age To calculate a confidence interval for the predicted mean value, we need an estimate of variability in the predicted mean seˆ yˆ s y| x 2 1 xx n n 2 x x i i 1 Confidence interval Note that the standard error equation has a different magnitude based on the x value. In particular, the magnitude is the least when x=the mean of x Since the test statistic is based on the tdistribution, our confidence interval is yˆ t / 2 , df * seˆ yˆ , yˆ t / 2,df * seˆ yˆ This confidence interval is rarely used for hypothesis testing because .95 .9 .85 .8 .75 20 30 40 Age 50 60 Prediction interval A confidence interval for a mean provides information regarding the accuracy of a estimated mean value for a sample size Often, we are interested in how accurate our prediction would be for a single observation, not the mean of a group of observations. This is called a prediction interval What would you estimate as the value for a single new observation? Do you think a prediction interval is narrower or wider? Prediction interval Confidence interval always tighter than prediction intervals The variability in the prediction of a single observation contains two types of variability – Variability of the estimate of the mean (confidence interval) – Variability around the estimate of the mean (residual variability) 2 2 ~ se y s y| x seˆ yˆ ~ ~ ~ ~ ˆ ˆ y t / 2,df * se y , y t / 2,df * se y 1 .9 .8 .7 20 30 40 Age 50 60 Conclusions Prediction interval is always wider than confidence interval – Common to find significant differences between groups but not be able to predict very accurately – To predict accurately for a single patient, we need limited overlap of the distribution. The benefit of an increased sample size decreasing the standard error does not help Model checking How good is our model? Although we have found a relationship between age and BPF, linear regression also allows us to assess how well our model fits the data R2=coefficient of determination=proportion of variance in the outcome explained by the model – When we have only one predictor, it is the proportion of the variance in y explained by x R 2 s s 2 y s 2 y 2 y| x 2 R What if all of the variability in y was explained by x? – What would R2 equal? – What does this tell you about the correlation between x and y? – What if the correlation between x and y is negative? What if none of the variability in y is explained by x? – What would R2 equal? – What is the correlation between x and y in this case? r vs. 2 R R2=(Pearson’s correlation coefficient)2=r2 Since r is between -1 and 1, R2 is always less than r Method r R2 – r=0.1, R2=0.01 – r=0.5, R2=0.25 Estimate -0.577 0.333 Evaluation of model Linear regression required several assumptions – – – – Linearity Homoscedasticity Normality Independence-usually from study design We must determine if the model assumptions were reasonable or a different model may have been needed Statistical research has investigated relaxing each of these assumptions Scatter plot A good first step in any regression is to look at the x vs. y scatter plot. This allows us to see – Are there any outliers? – Is the relationship between x and y approximately linear? – Is the variance in the data approximately constant for all values of x? Tests for the assumptions There are several different ways to test the assumptions of linear regression. – Graphical – Statistical Many of the tests use the residuals, which are the distances from the fitted line and the outcomes eˆi yi yˆi yi ˆ0 ˆ1 xi Residual plot -.1 -.05 0 .05 .1 If the assumptions of linear regression are met, we will observe a random scatter of points .8 .85 Fitted values .9 Investigating linearity Scatter plot of predictor vs outcome What do you notice here? One way to handle this is to transform the predictor to include a quadratic or other term Non-linear relationship 50 45 40 35 30 25 20 15 10 5 0 0 2 4 6 8 10 12 .8 .7 BPF .75 Research has shown that the decrease in BPF in normal people is pretty slow up until age 65 and then there is a more steep drop .65 .85 Aging 40 50 60 Age 70 80 .65 .7 .75 .8 .85 Fitted line 40 50 60 Age 70 80 Note how the majority of the values are above the fitted line in the middle and below the fitted line on the two ends What if we fit a line for this? .05 Residual plot shows a non-random scatter because the relationship is not really linear -.05 0 Residuals .72 .74 .76 .78 Fitted values .8 .82 What can we do? If the relationship between x and y is not linear, we can try a transformation of the values Possible transformations – Add a quadratic term – Fit a spline. This is when there is a slope for a certain part of the curve and a different slope for the rest of the curve .65 .7 .75 .8 .85 Adding a quadratic term 40 50 60 Age 70 80 -.05 0 Residuals .05 Residual plot .7 .72 .74 .76 Fitted values .78 .8 Checking linearity Plot of residuals vs. the predictor is also used to detect departures from linearity These plots allow you to investigate each predictor separately so becomes important in multiple regression If linearity holds, we anticipate a random scatter of the residuals on both types of residual plot Homoscedasticity The second assumption is equal variance across the values of the predictor The top plot shows the assumption is met, while the bottom plot shows that there is a greater amount of variance for larger fitted values 0 100000 200000 Expression level 300000 Example 1 2 3 Lipid number 4 5 6 0 Residuals 100000 -100000 In this example, we can fit a linear regression model assuming that there is a linear increase in expression with lipid number, but here is the residuals plot from this analysis What is wrong? 200000 Example -50000 0 50000 Fitted values 100000 log Expression level 12 10 8 6 Clearly, the residuals showed that we did not have equal variance What if we logtransform our yvalue? 14 Transform the y-value 1 2 3 4 Lipid number 5 6 New regression equation By transforming the outcome variable we have changed our regression equation: – Original: Expressioni =0+ 1*lipidi+ei – New: ln(Expressioni) =0+ 1*lipidi+ei What is the interpretation of 1 from the new regression model? – For every one unit increase in lipid number, there is a 1 unit increase in the ln(Expression) on average – The interpretation has changed due to the transformation 1 0 -1 -2 -3 On the logscale, the assumption of equal variance appears much more reasonable Residuals 2 Residual plot 7 8 9 Fitted values 10 11 Checking homoscedasticity If we do not appear to have equal variance, a transformation of the outcome variable can be used – Most common are log-transformation or square root transformation Other approaches involving weighted least squares can also be used if a transformation does not work Normality – Histogram of residuals – Normal probability plot 8 6 Density 4 2 0 Regression requires that the residuals are normally distributed To test if the residuals are normal: -.1 -.05 0 .05 resid Several statistical tests for normality of residuals are also available What if normality does not hold? Transformations of the outcome can often help Changing to another type of regression that does not require normality of the residuals – Logistic regression – Poisson regression Outliers Investigating the residuals also provides information regarding outliers If a value is extreme in the vertical direction, the residual will be extreme as well – You will see this in lab If a value is extreme in the horizontal direction, this value can have too much importance (leverage) – This is beyond the scope of this class Example Another measure of disease burden in MS is the T2 lesion volume in the brain – Over the course of the disease patients accumulate brain lesions that they do not recover from This is a measure of the disease burden in the brain Is the significant linear relationship between T2 lesion volume and age? 20 30 40 Age 50 60 0 10 20 Lesion volume 30 Linear model Our initial linear model: – LVi =0+1*agei +ei – What is the interpretation of 1? – What is the interpretation of 0? Using STATA, we get the following regression equation: LVˆ 3.70 0.062 * age i i – Is there a significant relationship between age and lesion volume? Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: 1 =0 Continuous outcome, continuous predictor Linear regression Test statistic: t=0.99 (102 dof) p-value=0.32 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between age and lesion volume . regress lv_entry age Source SS df MS Model Residual 33.1886601 3440.84404 1 102 33.1886601 33.7337651 Total 3474.0327 103 33.7284729 lv_entry Coef. age _cons .0623605 3.699857 Std. Err. .0628706 2.742369 Estimated coefficient t 0.99 1.35 Number of obs F( 1, 102) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.324 0.180 p-value = 104 = 0.98 = 0.3236 = 0.0096 = -0.0002 = 5.8081 [95% Conf. Interval] -.0623429 -1.739618 .187064 9.139333 30 20 Residuals 10 0 -10 5 5.5 6 6.5 Fitted values 7 7.5 Linear model Our initial linear model: – ln(LVi) =0+1*agei +ei – What is the interpretation of 1? – What is the interpretation of 0? Using STATA, we get the following regression equation: ln̂( LVi ) 1.36 0.0034 * agei – Is there a significant relationship between age and lesion volume? Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: 1 =0 Continuous outcome, continuous predictor Linear regression Test statistic: t=0.38 (102 dof) p-value=0.71 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between age and lesion volume . regress lnlv age Source SS df MS Model Residual .100352931 71.4750773 1 102 .100352931 .700736052 Total 71.5754302 103 .69490709 lnlv Coef. age _cons .0034291 1.355875 Std. Err. .0090613 .3952489 Estimated coefficient t 0.38 3.43 Number of obs F( 1, 102) Prob > F R-squared Adj R-squared Root MSE = 104 = 0.14 = 0.7059 = 0.0014 = -0.0084 = .8371 P>|t| [95% Conf. Interval] 0.706 0.001 -.014544 .5719006 p-value .0214022 2.139849 2 1 Residuals 0 -1 -2 1.4 1.45 1.5 Fitted values 1.55 Histograms of residuals Untransformed values Transformed values Conclusions for model checking Checking model assumptions for linear regression is needed to ensure inferences are correct – If you have the wrong model, your inference will be wrong as well Majority of model checking based on the residuals If model fit is bad, should use a different model Dichotomous predictors Linear regression with dichotomous predictor Linear regression can also be used for dichotomous predictors, like sex To do this, we use an indicator variable, which equals 1 for male and 0 for female. The resulting regression equation for BPF is E ( BPF | sex) 0 1sex BPFi 0 1sexi e i .75 .8 BPF .85 .9 .95 Graph 0 .2 .4 .6 Sex .8 1 The regression equation can be rewritten as BPF female 0 e i BPFmale 0 1 e i The meaning of the coefficients in this case are – 0 is the mean BPF when sex=0, in the female group – 0 1 is the mean BPF when sex=1, in the male group What is the interpretation of 1? – For a one-unit increase in sex, there is a 1 increase in mean of the BPF – The difference in mean BPF between the males and females Interpretation of results The final regression equation is BPFˆ 0.823 0.037 * sex The meaning of the coefficients in this case are – 0.823 is the estimate of the mean BPF in the female group – 0.037 is the estimate of the mean increase in BPF between the males and females – What is the estimated mean BPF in the males? How could we test if the difference between the groups is statistically significant? Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: There is no difference based on gender (1 =0) Continuous outcome, dichotomous predictor Linear regression Test statistic: t=1.82 (27 dof) p-value=0.079 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant difference in the mean BPF in males compared to females . regress bpf sex Source SS df MS Model Residual .007323547 .059426595 1 27 .007323547 .002200985 Total .066750142 28 .002383934 bpf Coef. sex _cons .0371364 .8228636 Std. Err. .0203586 .0100022 Estimated difference between groups t 1.82 82.27 Number of obs F( 1, 27) Prob > F R-squared Adj R-squared Root MSE = = = = = = 29 3.33 0.0792 0.1097 0.0767 .04691 P>|t| [95% Conf. Interval] 0.079 0.000 -.004636 .8023407 p-value for difference .0789087 .8433865 .1 .05 -.05 0 Residuals -.1 .82 .83 .84 Fitted values .85 .86 T-test As hopefully you remember, you could have tested this same null hypothesis using a two sample t-test Linear regression makes an equal variance assumption, so let’s use the same assumption for our t-test Hypothesis test 1) 2) 3) 4) 5) 6) 7) H0: There is no difference based on gender Continuous outcome, dichotomous predictor t-test Test statistic: t=-1.82 (27 dof) p-value=0.079 Since the p-value is more than 0.05, we fail to reject the null hypothesis We conclude that there is no significant difference in the mean BPF in males compared to females . ttest bpf, by(sex) Two-sample t test with equal variances Group Obs Mean 0 1 22 7 combined 29 diff Std. Err. Std. Dev. [95% Conf. Interval] .8228636 .86 .0096717 .0196457 .0453645 .0519775 .8027502 .8119288 .8429771 .9080712 .8318276 .0090667 .0488255 .8132553 .8503998 -.0371364 .0203586 -.0789087 .004636 diff = mean(0) - mean(1) Ho: diff = 0 Ha: diff < 0 Pr(T < t) = 0.0396 t = degrees of freedom = Ha: diff != 0 Pr(|T| > |t|) = 0.0792 -1.8241 27 Ha: diff > 0 Pr(T > t) = 0.9604 Amazing!!! We get the same result using both approaches!! Linear regression has the advantages of: – Allowing multiple predictors (tomorrow) – Accommodating continuous predictors (relationship to correlation) – Accommodating categorical predictors (tomorrow) Very flexible approach Conclusion Indicator variables can be used to represent dichotomous variables in a regression equation Interpretation of the coefficient for an indicator variable is the same as for a continuous variable – Provides a group comparison Tomorrow we will see how to use regression to match ANOVA results