Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lasso (statistics) wikipedia , lookup
Forecasting wikipedia , lookup
Time series wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Linear regression wikipedia , lookup
Biostat 200 Lecture 10 1 Simple linear regression • Population regression equation μy|x = α + x • α and are constants and are called the coefficients of the equation • α is the y-intercept and which is the mean value of Y when X=0, which is μy|0 • The slope is the change in the mean value of y that corresponds to a one-unit increase in x • E.g. X=3 vs. X=2 μy|3 - μy|2 = Pagano and Gauvreau, Chapter 18 (α + *3 ) – (α + *2) = 2 Simple linear regression • The linear regression equation is y = α + x + ε • The error, ε, is the distance a sample value y has from the population regression line y = α + x + ε μy|x = α + x so y- μy|x = ε Pagano and Gauvreau, Chapter 18 3 Simple linear regression • Assumptions of linear regression – X’s are measured without error • Violations of this cause the coefficients to attenuate toward zero – For each value of x, the y’s are normally distributed with mean μy|x and standard deviation σy|x – μy|x = α + βx – Homoscedasticity – the standard deviation of y at each value of X is constant; σy|x the same for all values of X • The opposite of homoscedasticity is heteroscedasticity • This is similar to the equal variance issue that we saw in ttests and ANOVA – All the yi ‘s are independent (i.e. you couldn’t guess the y value for one person (or observation) based on the outcome of another) • Note that we do not need the X’s to be normally distributed, just the Y’s at each value of X Pagano and Gauvreau, Chapter 18 4 Simple linear regression • The regression line equation is yˆ ˆ ˆx • The “best” line is the one that finds the α and β that minimize the sum of the squared residuals Σei2 (hence the name “least squares”) • We are minimizing the sum of the squares of the residuals n n 2 2 ˆ e ( y y ) i i i i 1 i 1 n [ yi (ˆ ˆxi )]2 i 1 Pagano and Gauvreau, Chapter 18 5 Simple linear regression example: Regression of age on FEV FEV= α̂ + β̂ age regress yvar xvar . regress fev age Source | SS df MS -------------+-----------------------------Model | 280.919154 1 280.919154 Residual | 210.000679 652 .322086931 -------------+-----------------------------Total | 490.919833 653 .751791475 Number of obs F( 1, 652) Prob > F R-squared Adj R-squared Root MSE = = = = = = 654 872.18 0.0000 0.5722 0.5716 .56753 -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .222041 .0075185 29.53 0.000 .2072777 .2368043 _cons | .4316481 .0778954 5.54 0.000 .278692 .5846042 ------------------------------------------------------------------------------ β̂ ̂ = Coef for age α̂ = _cons (short for constant) 6 model sum of squares MSS i 1 ( yˆi y )2 n regress fev age Source | SS df MS -------------+-----------------------------Model | 280.919154 1 280.919154 Residual | 210.000679 652 .322086931 -------------+-----------------------------Total | 490.919833 653 .751791475 Number of obs F( 1, 652) Prob > F R-squared Adj R-squared Root MSE = = = = = = 654 872.18 0.0000 0.5722 0.5716 .56753 =.75652 -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .222041 .0075185 29.53 0.000 .2072777 .2368043 _cons | .4316481 .0778954 5.54 0.000 .278692 .5846042 ------------------------------------------------------------------------------ residual sum of squares RSS i 1 ( yi yˆi )2 n total sum of squares TSS MSS RSS i 1 ( yi y ) 2 n 7 Pagano and Gauvreau, Chapter 18 Inference for regression coefficients • We can use these to test the null hypothesis H0: = 0 ˆ 0 • The test statistic for this is t ˆ ˆ se ( ) • And it follows the t distribution with n-2 degrees of freedom under the null hypothesis • 95% confidence intervals for ( β̂ - tn-2,.025se(β̂) , β̂ + tn-2,.025se(β̂) ) 8 Inference for predicted values • We might want to estimate the mean value of y at a particular value of x • E.g. what is the mean FEV for children who are 10 years old? ŷ = .432 + .222*x = .432 + .222*10 = 2.643 liters 9 Inference for predicted values • We can construct a 95% confidence interval for the estimated mean • ( ŷ - tn-2,.025se(ŷ) , ŷ + tn-2,.025se(ŷ) ) where 2 seˆ( yˆ ) s y|x 1 (x x) n n ( xi x )2 i 1 2 ˆ ( y y ) i1 i n where s y|x n2 RSS n2 • Note what happens to the terms in the square root when n is large 10 • Stata will calculate the fitted regression values and the standard errors – regress fev age – predict fev_pred, xb -> predicted mean values (ŷ) – predict fev_predse, stdp -> se of ŷ values New variable names that I made up 11 . list fev age fev_pred fev_predse 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +-----------------------------------+ | fev age fev_pred fev_pr~e | |-----------------------------------| | 1.708 9 2.430017 .0232702 | | 1.724 8 2.207976 .0265199 | | 1.72 7 1.985935 .0312756 | | 1.558 9 2.430017 .0232702 | | 1.895 9 2.430017 .0232702 | |-----------------------------------| | 2.336 8 2.207976 .0265199 | | 1.919 6 1.763894 .0369605 | | 1.415 6 1.763894 .0369605 | | 1.987 8 2.207976 .0265199 | | 1.942 9 2.430017 .0232702 | |-----------------------------------| | 1.602 6 1.763894 .0369605 | | 1.735 8 2.207976 .0265199 | | 2.193 8 2.207976 .0265199 | | 2.118 8 2.207976 .0265199 | | 2.258 8 2.207976 .0265199 | 336. | 3.147 337. | 2.52 338. | 2.292 13 10 10 3.318181 2.652058 2.652058 .0320131 | .0221981 | .0221981 | 12 1 2 3 4 5 6 95% CI for the predicted means for each age 0 5 10 age 15 Note that the Cis get wider as you get farther from x̅ ; but here n is large so the CI is still 20 very narrow twoway (scatter fev age) (lfitci fev age, ciplot(rline) blcolor(black)), legend(off) title(95% CI for the predicted means for each age ) 13 1 2 3 4 5 95% CI for the predicted means for each age n=10 5 10 15 20 age The 95% confidence intervals get much wider with a small sample size 14 Prediction intervals • The intervals we just made were for means of y at particular values of x • What if we want to predict the FEV value for an individual child at age 10? • Same thing – plug into the regression equation: ŷ =.432 + .222*10 = 2.643 liters • But the standard error of ỹ is not the same as the standard error of ŷ 15 Prediction intervals seˆ( ~ y ) s y|x 1 ( x x )2 1 n n ( xi x ) 2 i 1 s 2y|x s 2y|x n s 2y|x ( x x ) 2 2 ( x x ) i1 i n • This differs from the se(ŷ) only by the extra variance of y in the formula • But it makes a big difference • There is much more uncertainty in predicting a future value versus predicting a mean •Stata will calculate these using predict fev_predse_ind, stdf f is for forecast 16 . list fev age fev_pred fev_predse fev_pred_ind 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. +----------------------------------------------+ | fev age fev_pred fev~edse fev~ndse | |----------------------------------------------| | 1.708 9 2.430017 .0232702 .5680039 | | 1.724 8 2.207976 .0265199 .5681463 | | 1.72 7 1.985935 .0312756 .5683882 | | 1.558 9 2.430017 .0232702 .5680039 | | 1.895 9 2.430017 .0232702 .5680039 | |----------------------------------------------| | 2.336 8 2.207976 .0265199 .5681463 | | 1.919 6 1.763894 .0369605 .5687293 | | 1.415 6 1.763894 .0369605 .5687293 | | 1.987 8 2.207976 .0265199 .5681463 | | 1.942 9 2.430017 .0232702 .5680039 | |----------------------------------------------| | 1.602 6 1.763894 .0369605 .5687293 | | 1.735 8 2.207976 .0265199 .5681463 | | 2.193 8 2.207976 .0265199 .5681463 | | 2.118 8 2.207976 .0265199 .5681463 | | 2.258 8 2.207976 .0265199 .5681463 | 336. | 3.147 337. | 2.52 338. | 2.292 13 10 10 3.318181 2.652058 2.652058 .0320131 .0221981 .0221981 .5684292 | .567961 | .567961 | 17 4 6 95% prediction interval and CI 0 2 Note the width of the confidence intervals for the means at each x versus the width of the prediction intervals 0 5 10 age 15 20 twoway (scatter fev age) (lfitci fev age, ciplot(rline) blcolor(black) ) (lfitci fev age, stdf ciplot(rline) blcolor(red) ), legend(off) title(95% prediction interval and 18 CI ) 0 2 4 6 95% prediction interval and CI n=10 5 10 15 20 age The intervals are wider farther from x̅, but that is only apparent for small n because most of the width is due to the added sy|x 19 Model fit • A summary of the model fit is the coefficient of determination, R2 R2 s 2y s 2y|x s 2y • R2 represents the portion of the variability that is removed by performing the regression on X • R2 is calculated from the regression with MSS/TSS • The F statistic compares the model fit to the residual variance • When there is only one independent variable in the model, the F statistic is equal to the square of the tstat for 20 MSS i 1 ( yˆi y )2 n MSS Fstatistic( MSS df, RSS df) MSS df RSS RSS df regress fev age Source | SS df MS -------------+-----------------------------Model | 280.919154 1 280.919154 Residual | 210.000679 652 .322086931 -------------+-----------------------------Total | 490.919833 653 .751791475 Number of obs F( 1, 652) Prob > F R-squared Adj R-squared Root MSE = = = = = = 654 872.18 0.0000 0.5722 0.5716 .56753 =.75652 -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .222041 .0075185 29.53 0.000 .2072777 .2368043 _cons | .4316481 .0778954 5.54 0.000 .278692 .5846042 ------------------------------------------------------------------------------ RSS i 1 ( yi yˆi )2 n R2 TSS i 1 ( yi y )2 n TSS RSS MSS TSS TSS 21 Pagano and Gauvreau, Chapter 18 Model fit -- Residuals • Residuals are the difference between the observed y values and the regression line for each value of x • yi-ŷi • If all the points lie along a straight line, the residuals are all 0 • If there is a lot of variability at each level of x, the residuals are large • The sum of the squared residuals is what was minimized in the least squares method of fitting the line 22 1 2 3 4 5 6 FEV versus age 0 5 10 age 15 20 23 Residuals • We examine the residuals using scatter plots • We plot the fitted values ŷi on the x-axis and the residuals yi-ŷi on the y-axis • We use the fitted values because they have the effect of the independent variable removed • To calculate the residuals and the fitted values Stata: regress fev age predict fev_res, r predict fev_pred, xb *** the residuals *** the fitted values 24 2 1 0 -1 -2 Residuals Fitted values versus residuals for regression of FEV on age 1 2 3 Linear prediction 4 scatter fev_res fev_pred, title(Fitted values versus residuals for regression of FEV on age) 5 25 • This plot shows that as the fitted value of FEV increases, the spread of the residuals increase – this suggests heteroscedasticity • We had a hint of this when looking at the box plots of FEV by age groups in the previous lecture 26 1 2 3 fev 4 5 6 FEV by age 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 graph box fev, over(age) title(FEV by age) 27 Transformations • One way to deal with this is to transform either x or y or both • A common transformation is the log transformation • Log transformations bring large values closer to the rest of the data 28 Log function refresher • Log10 – – – – – Log10(x) = y means that x=10y So if x=1000 log10(x) = 3 because 1000=103 Log10(103) = 2.01 because 103=102.01 Log10(1)=0 because 100 =1 Log10(0)=-∞ because 10-∞ =0 • Loge or ln – – – – – e is a constant approximately equal to 2.718281828 ln(1) = 0 because e0 =1 ln(e) = 1 because e1 =e ln(103) = 4.63 because 103=e4.63 Ln(0)=-∞ because e-∞ =0 29 Log transformations Value 0 0.001 0.05 1 5 10 50 103 Ln -∞ -6.91 -3.00 0.00 1.61 2.30 3.91 4.63 Log10 -∞ -3.00 -1.30 0.00 0.70 1.00 1.70 2.01 • Be careful of log(0) or ln(0) • Be sure you know which log base your computer program is using • In Stata use log10() and ln() (log() will give you ln() 30 • Let’s try transforming FEV to ln(FEV) . gen fev_ln=log(fev) . summ fev fev_ln Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------fev | 654 2.63678 .8670591 .791 5.793 fev_ln | 654 .915437 .3332652 -.2344573 1.75665 • Run the regression of ln(FEV) on age and examine the residuals regress fev_ln age predict fevln_pred, xb predict fevln_res, r scatter fevln_res fevln_pred, title(Fitted values versus residuals for regression of lnFEV on age) 31 32 33 Interpretation of regression coefficients for transformed y value • Now the regression equation is: ln(FEV) = ̂ + ̂ age = 0.051 + 0.087 age • So a one year change in age corresponds to a .087 change in ln(FEV) • The change is on a multiplicative scale, so if you exponentiate, you get a percent change in y • e0.087 = 1.09 – so a one year change in age corresponds to a 9% increase in FEV 34 • Note that heteroscedasticity does not bias your estimates of the parameters, it only reduces the precision of your estimates • There are methods to correct the standard errors for heteroscedasticity other than transformations 35 Now using height • Residual plots also allow you to look at the linearity of your data • Construct a scatter plot of FEV by height • Run a regression of FEV on height • Construct a plot of the residuals vs. the fitted values 36 37 38 39 Residuals using ht2 as the independent variable 0 -1 -2 Residuals 1 2 Residual plot for regression of FEV on ht squared 1 2 3 Linear prediction Regression equation FEV=+ *ht2 + 4 5 40 Residuals using ln(ht) as the dependent variable -1 -.5 Residuals 0 .5 Fitted values versus residuals for regression of lnFEV on ht 0 .5 1 1.5 Linear prediction Regression equation lnFEV=+ *ht + 41 Categorical independent variables • We previously noted that the independent variable (the X variable) does not need to be normally distributed • In fact, this variable can be categorical • Dichotomous variables in regression models are coded as 1 to represent the level of interest and 0 to represent the comparison group. These 0-1 variables are called indicator or dummy variables. • The regression model is the same • The interpretation of ̂ is the change in y that corresponds to being in the group of interest vs. not 42 Categorical independent variables • • • • • Example sex: female xsex=1, for male xsex =0 Regression of FEV and sex fêv = ̂ + ̂ xsex For male: fêvmale = ̂ For female: fêvfemale = ̂ + ̂ So fêvfemale - fêvmale = ̂ + ̂ - ̂ = ̂ 43 • Using the FEV data, run the regression with FEV as the dependent variable and sex as the independent variable • What is the estimate for beta? How is it interpreted? • What is the estimate for alpha? How is it interpreted? • What hypothesis is tested where it says P>|t|? • What is the result of this test? • How much of the variance in FEV is explained by sex? 44 45 Categorical independent variable • Remember that the regression equation is μy|x = α + x • The only variables x can take are 0 and 1 • μy|0 = α μy|1 = α + • So the estimated mean FEV for males is ̂ and the estimated mean FEV for females is ̂ + ̂ • When we conduct the hypothesis test of the null hypothesis =0 what are we testing? • What other test have we learned that tests the same thing? Run that test. 46 47 Categorical independent variables • In general, you need k-1 dummy or indicator variables (0-1) for a categorical variable with k levels • One level is chosen as the reference value • Indicator variables are set to one for each category for only one of the dummy variables, they are set to 0 otherwise 48 Categorical independent variables • E.g. Alcohol = None, Moderate, Hazardous • If Alcohol=non is set as reference category, dummy variables look like: xModerate xHazardous None 0 0 Moderate 1 0 Hazardous 0 1 49 Categorical independent variables • Then the regression equation is: y = + 1 xmoderate + 2 xHazardous + ε • For Alcohol consumption=None ŷ = ̂ +v ̂10+ ̂20 = ̂ • For Alcohol consumption=Moderate ŷ = ̂ + ̂11 + ̂20 = ̂ + ̂1 • For Alcohol consumption=Hazardous ŷ = ̂ + ̂10 + ̂21 = ̂ + ̂2 50 • You actually don’t have to make the dummy variables yourself (when I was a girl we did have to do) • All you have to do is tell Stata that a variable is categorical using i. before a variable name • Run the regression equation for the regression of BMI regressed on race group (using the class data set) regress bmi i.auditc_cat 51 52 • What is the estimated mean BMI for alcohol consumption = None? • What is the estimated mean BMI for alcohol consumption = Hazardous? • What do the estimated betas signify? • What other test looks at the same thing? Run that test. 53 54 • A new Stata trick allows you to specify the reference group with the prefix b# where # is the number value of the group that you want to be the reference group. • Try out regress bmi b2.auditc_cat • Now the reference category is auditc_cat=2 which is the hazardous alcohol group • Interpret that parameter estimates • Note if other output is changed 55 56 Multiple regression • Additional explanatory variables might add to our understanding of a dependent variable • We can posit the population equation μy|x1,x2,...,xq = α + 1x1 + 2x2 + ... + qxq • α is the mean of y when all the explanatory variables are 0 • i is the change in the mean value of y the corresponds to a 1 unit change in xi when all the other explanatory variables are held constant 57 • Because there is natural variation in the response variable, the model we fit is y = α + 1x1 + 2x2 + ... + qxq + • Assumptions – x1,x2,...,xq are measured without error – The distribution of y is normal with mean μy|x1,x2,...,xq and standard deviation σy|x1,x2,...,xq – The population regression model holds – For any set of values of the explanatory variables, x1,x2,...,xq , σy|x1,x2,...,xq is constant – homoscedasticity – The y outcomes are independent 58 Multiple regression – Least Squares • We estimate the regression line ŷ = α̂ + β̂1x1 + β̂2x2 + ... + β̂qxq using the method of least squares to minimize n n 2 ˆ e ( y y ) i i i 1 2 i i 1 n [ yi (ˆ ˆ1 x1i ˆ2 x2i ... ˆq xqi )]2 i 1 59 Multiple regression • For one predictor variable – the regression model represents a straight line through a cloud of points -- in 2 dimensions • With 2 explanatory variables, the model is a plane in 3 dimensional space (one for each variable) • etc. • In Stata we just add explanatory variables to the regress statement • Try regress fev age ht 60 61 • We can test hypotheses about individual slopes • The null hypothesis is H0: i = i0 assuming that the values of the other explanatory variables are held constant ˆ • The test statistic t seˆ( ˆ ) follows a t distribution with n-q-1 degrees of freedom i i0 i 62 . regress fev age ht Source | SS df MS -------------+-----------------------------Model | 376.244941 2 188.122471 Residual | 114.674892 651 .176151908 -------------+-----------------------------Total | 490.919833 653 .751791475 Number of obs F( 2, 651) Prob > F R-squared Adj R-squared Root MSE = 654 = 1067.96 = 0.0000 = 0.7664 = 0.7657 = .4197 -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | .0542807 .0091061 5.96 0.000 .0363998 .0721616 ht | .1097118 .0047162 23.26 0.000 .100451 .1189726 _cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085 ------------------------------------------------------------------------------ •Now the F-test has 2 degrees of freedom in the numerator because there are 2 explanatory variables •R2 will always increase as you add more variables into the model •The Adj R-squared accounts for the addition of variables and is comparable across models with different numbers of parameters •Note that the beta for age decreased 63 Examine the residuals… 0 -1 -2 Residuals 1 2 Residuals versus fitted for regression of FEV on age and ht 1 2 3 Linear prediction 4 5 64 0 -.2 -.4 -.6 Residuals .2 .4 Residuals versus fitted for regn of LN FEV on age and ht 0 .5 1 1.5 Linear prediction 65 For next time • Read Pagano and Gauvreau – Pagano and Gauvreau Chapters 18-19 (review) – Pagano and Gauvreau Chapter 20