Download Stat 502 - Topic #2

Topic 2 – Simple Linear Regression KKNR Chapters 4 – 7 1 Overview  Regression Models; Scatter Plots   Estimation and Inference in SLR   SAS GPLOT Procedure SAS REG Procedure ANOVA Table & Coefficient of Determination (R2) 2 Simple Linear Regression Model  We take n pairs of observations (X 1,Y 1 ), (X 2,Y 2 ),..., (X n ,Y n )  The goal is to find a model that best fits with the data.  Model will be linear in terms of the parameters (betas). These won’t appear in exponents or anything unusual.  Allowed to be nonlinear in terms of predictor variables (we may transform these somewhat freely). We may also transform the response. 3 Simple Linear Regression Model (2)  Some sample models Yi   0  1 X i   i Yi   0  1  log X i    i log Yi   0  1  X i  i Notice the betas always function in the same way, and the analysis will always proceed in the same way too (after we make whatever transformations we might need). 4 Simple Linear Regression Model (3)  Key question: How do you decide on the “best” form for the model?  Always view a scatter plot (use PROC GPLOT in SAS). Curvature in the plot will help you determine the need for a transformation on either X or Y.  Always consider residual plots. Some patterns in these plots will also indicate the need for transformation (more on this later). 5 Scatter Plot Approach  If you can look at a scatter plot and the data “look linear”, then likely no transformation is necessary. Try not to look for things that are not there.  If you see curvature, then some transformation may be appropriate:  Use scientific theory & experience  Try transformations you think may work – look at scatter plots of the transformed data to assess whether they do work. 6 Finding the “Best” Model  There is no “absolute” strategy.  Some common mistakes (why are these bad?):  Try several different methods and simply take the one for which you get the best results (e.g. highest R2)  Over-fit the model by including lots of extra terms (e.g. squares, cubes, etc.) in hopes to get the curve to go through all of the data points (note that this would be MLR) 7 Collaborative Learning Activity CLG #2.1-2.3 First, make sure you read enough to understand the dataset we will be considering. Then, please try to answer these questions related to scatter plots. 8 Scatter Plot Examples (1) Wi t h E s t i ma t e d Re g r e s s i o n Li ne 1200 1100 1000 900 800 3 4 5 6 S t a t e wi d e 7 Ex p e n d i t u r e s 8 9 10 9 Scatter Plot Examples (2) Wi t h E s t i ma t e d Re g r e s s i o n Li ne 1200 1100 1000 900 800 0 10 20 30 Pe r c e n t a g e 40 of El i g i b l e 50 St u d e n t s 60 Tak i ng 70 SAT 80 90 10 Scatter Plot Examples (3) Wi t h No n p a r a me t e r i c 30 40 S mo o t h 1200 1100 1000 900 800 0 10 20 Pe r c e n t a g e of El i g i b l e 50 St u d e n t s 60 Tak i ng 70 SAT 80 90 11 Scatter Plot Examples (4) Log T r a n s f o r me d Pr e d i c t o r Wi t h No n p a r a me t e r i c S mo o t h 1200 1100 1000 900 800 1 2 L o g - T r a n s f o r me d 3 Pe r c e n t a g e of 4 El i g i b l e St u d e n t s Tak i ng 5 SAT 12 Comments on GPLOT  Utilize SYMBOL, AXIS, and TITLE statements to make your plots look nice.  ORFONT provides a good symbol-set. You can also manipulate the COLOR of symbols in order help the viewer differentiate groups.  Be careful to remember that SAS reuses these statements, so you will need to redefine them as necessary. 13 SAS ORFONT 14 Fitting the SLR Model • Once we decide on the form of our model, we need to estimate the parameters that yield the “best” fit. • The arithmetic involved is accomplished with a computer, but it is useful to have some understanding of the how the estimates are calculated. 15 The SLR Model  Whatever the transformations may be, our model is in the form of a straight line: Y  0  1 X    Epsilon represents the inherent variation (or error) in the model.  Model involves two other parameters (unknown, but fixed in value):  slope (change in y for a one unit change in x)  intercept (value of y for x = 0; usually not particularly interested in this) 16 Observations  An observation Y at a particular X is a random variable. So you can think of each observation as having been drawn from a normal distribution centered at  0  1 X and having standard deviation  .  Be careful to remember that these parameters (represented by greek letters) are fixed – but can never be known exactly. We can only estimate them. 17 Graphical Representation 18 Model Assumptions  We make three assumptions on the error term in our model. A simple statement of IID 2 these assumptions is that  i ~ N 0,    The assumptions on the errors apply to both regression and ANOVA and we will be assuming these throughout the course.  For regression, we also make a 4th assumption that our model (in this case linear relationship between X and Y) is appropriate. 19 Assumptions on Errors  Constance Variance (Homoscedasticity) – the variance associated to the error is the same for ANY value of X.  Normality – the errors follow a normal distribution with a mean of zero.  Independence – the errors (and hence also the responses) are statistically independent of each other. 20 Checking Your Assumptions  Reminder: We can never know the exact values of the errors because we can never know the true regression equation.  We can (and will) estimate the errors by the residuals. The residuals can then be used (mostly in graphical analyses) to assess the assumptions – giving us some idea of whether the assumptions of our model are satisfied. More on this later... 21 Estimation of the “Best” Line  We want to obtain estimates of the parameters  0 and 1 (remember, we can never know them exactly).  Notation: Generally, I will use lower case English letters to represent estimates for parameters. You may also see hat-notation. For example, if  is a parameter, ˆ would be its estimated value from data.  Our estimates will be denoted  b0 , b1  . The residuals will be denoted by ei . 22 Parameter Estimates    Key Point: Parameters are fixed, but their estimates are random variables. If we take a different sample, we’ll get a different estimate. Thus all of the estimates we compute  b0 , b1 , ei  will have associated standard errors that we may also estimate. The method of least squares is used to obtain both parameter estimates and standard errors. This method is desirable because the estimates are unbiased, minimum variance estimates. 23 Least-squares Method  The least squares method obtains the estimated regression line that minimizes the sum of the squared residuals (also called the SSE or sum of squares error). 2 2 SSE  e  Y  Yˆ   i  i i  Another way to think of this is that the least squares estimates allow us to explain as much of the variation in Y as we possibly can using X as a predictor. The SSR (sums of squares due to our model) is maximized. 24 Least Squares Estimates  The estimates have formulas in terms of the data: n  X i  X Yi  Y   b1  i 1 n 2  Xi  X  i 1 b0  Y  b1 X ei  Yi  Yî  Yi   b0  b1 X i  s 2  ˆ 2  SSE  n  2  25 Least Squares Estimates (2)  It is not important to memorize these formulas. I won’t ask you to calculate a parameter estimate by hand from the data. We have computers for this.  What is important will be to understand that, because the Yi are random variables, and because all of these estimates depend on the Yi, the estimates themselves will also be random variables. Thus we may estimate their standard errors, develop confidence intervals, and draw statistical inferences. 26 Inference about the Slope • A non-zero slope implies a linear association between the predictor and response. • In some experimental cases the relationship may be causal as well. • Thus statistical inference for the slope is quite important. 27 Inference About the Slope  The first thing to remember is that b1 is a random variable. In order to do inference, we must first consider that...   X  X Y  Y  n  b1  i i 1 n  X i 1  i i X 2 is normally distributed (why?) The standard error associated to b1 is s b1  MSE MSE  n  X i  X 2 SSX i 1 28 Slope Inference (2)  For testing H 0 : 1  k , the statistic b1  k T s b1 has a t-distribution with n – 2 degrees of freedom when the null hypothesis is true.  A two-sided confidence interval for the slope will be: b1  tn2,1 2 s b1 29 Slope Inference (3)   Want to determine: Does X help explain Y through a linear model???? If we reject the null hypothesis H 0 : 1  0, then we may conclude that there is a linear association between X and Y   Must have assumptions satisfied. Key point: Failing to reject does not necessarily allow us to conclude that X is unimportant  Maybe we need a bigger sample to give better power 30 Slope Inference (4)  Another Key Point: Violations of the model assumptions may invalidate the significance test.  In particular, if there is a nonlinear association or some type of dependence issue, the SLR model should not be used.  See pages 65 for some pictures illustrating this. 31 Experimental Control  In some situations you have experimental control over your predictor variable. Thus you have some control over the SE for the slope: MSE s b1  SSX  Making SSX large will decrease the SE of your estimate for the slope. Do this by spreading you chosen X values further apart.  Increasing n (and hence increasing degrees of freedom) may also help to decrease the SE. 32 Inference About the Intercept  Hypothesis tests and confidence intervals may be constructed similar to inference for the slope. See page 63.  Key point: Unless the observed predictor is often in the neighborhood of zero, we have no reason to be interested in the intercept.  In fact, the intercept will usually just be an artifact of the model. And if the scope of the model does not include zero, there is no reason to even worry whether the value of b0 makes sense. 33 Further Inference  Confidence Intervals for the Mean Response  Prediction Intervals 34 The Predicted Value  The line describes the mean population response for each value of the predictor. If an association exists, then the mean response depends on the value of X.  The predicted value of Y at a given X = x0 is Yˆx0  b0  b1 x0  Reminder: Notation may differ some from the text – I try to keep our notation as simple as possible. 35 C.I. for the Mean Response or Prediction Interval?  If you are trying to predict for a group   C.I. for the Mean Response  Example: Trying to predict the average blood pressure for all 40 year olds  Interval is usually narrower If you are trying to predict for a single observation  Prediction Interval  Example: Trying to predict the blood pressure for a single 40 year old  Interval is usually much wider because of individual variation  Some 40 year olds will have much higher or lower B.P.s 36 *Calculations of SE for Mean Response  First step is to write in terms of estimates. We also need to use a small trick to avoid worrying about a covariance between the two parameter estimates.   Var Yˆx0  Var  b0  b1 x0   Var Y  b1 x  b1 x0   Var Y   x0  x  b1  37 *Calculations of SE for Mean Response (2)  We know the variances for Y-bar and b1. And it turns out that, even though Y-bar is used in the calculation of b1, the two are still independent. So the variance of the sum is the sum of the variances:   2 ˆ Var Yx0  Var Y    x0  x  Var  b1    2 n 2 x  x   0  2  SSX 2  x0  x    2 1     SSX   n 38 SE for the Mean Response  The mean response is a random variable since b0 and b1 are random. Hence it will have a standard error.   s Yˆx0   1  x0  x 2   MSE    n  SSX   It is good to have some understanding of how this works, so we will look very briefly at the calculations (you should just try to follow them, but not worry about memorizing them) 39 Confidence Intervals for Mean Response  You sometimes want to get confidence intervals for the mean response. Because we have estimated both the mean and variance, the t-distribution applies (n – 2 degrees of freedom). The CI for a given value of X is:   Yˆx0  tn 2,1 / 2 s Yˆx0 40 SE for the Prediction Interval  So the prediction variance is the sum of these two components: Var  pred x0   Var Yˆx0   2    1  X  X 2  0   2   2   SSX n     s pred x0   1  X  X 2  0   MSE 1   SSX  n    41 Prediction Intervals (1)  Now consider predicting a new observation at X = x0. Our point estimate for this would just be the point on the regression line, Yˆx0 .  Our prediction interval will be of the same form as for the mean response, but with a different standard error:  Yˆx0  tn2,1 / 2 s pred x0  42 Why the difference for the Prediction Interval?  The key to understanding the standard error for prediction is to understand the random components involved.  (1) The regular variance associated to getting a predicted value (same as the CI mean response error)  (2) The individual error for a single observation  It basically gives us an extra σ2 piece  Think of the NORMAL DISTRIBUTION centered around the regression line (see slide 18) 43 Multiple Confidence Intervals  We did intervals for 40 year olds, both group and individual, what about 30? 35?  Getting multiple CI’s presents a similar problem to multiple hypothesis tests. We would expect one errant CI for every 20 CI’s that we obtain at 95% confidence. Thus some adjustment may need to be made.  Bonferroni is too conservative here, because these CI’s are actually dependent and it is possible to take advantage of this. 44 Confidence Bands  The solution is to change our critical value. Instead of using T, we use a critical value related to the F distribution: W  2 F2,n2,1  This allows us to produce CI’s for the mean response at any and all possible values for the predictor variable. Hence we may also use this to draw confidence bands around the regression line.  Useful trick: For significance level 0.05, the value of W is approximately 0.6 more than the value of T. This is slightly conservative, but simplifies computation. 45 Interpolation vs. Extrapolation  Interpolation (x0 within the domain of the observed X’s) is generally ok if the assumptions are satisfied.  Extrapolation (x0 outside the domain of the observed X’s) is usually a bad idea.  No assurance that linearity continues outside the observed domain.  Example – Height regressed on age in children. 46 Key Concepts  The standard error formulas for the slope, the regression line, and prediction are related – your goal should be to understand these relationships.  You should also be able to construct CI’s and do hypothesis tests.    Point Estimates Critical Values (know how to look these up) Standard Errors (generally would not be asked to compute these, just use them) 47 SAS Review proc reg data=sat; model score=expend / clb clm cli; id state expend; output out=fit r=res p=pred; 48 Collaborative Learning Activity Please complete problems 2.4 (constructing CI’s) and 2.5 (interpreting regression output) on the handout. 49 Output: ANOVA Table Source Model Error Total DF 1 48 49 Root MSE Dependent Mean Coeff Var Sum of Squares 39722 234586 274308 69.90851 965.92000 7.23751 Mean Square 39722 4887.2 R-Square Adj R-Sq F Value 8.13 Pr > F 0.0064 0.1448 0.1270 50 Output: Parameter Estimates Parameter Estimates Variable Intercept expend Variable Intercept expend DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1089.29372 -20.89217 44.38995 7.32821 24.54 -2.85 <.0001 0.0064 DF 1 1 95% Confidence Limits 1000.04174 -35.62652 1178.54569 -6.15782 51 Output: Output Statistics Output Statistics Obs state expend 9 Louisian 4.761 10 Minnesot 6 11 Missouri 5.383 12 Nebraska 5.935 Obs state expend 9 Louisian 4.761 10 Minnesot 6 11 Missouri 5.383 12 Nebraska 5.935 Dependent Predicted Std Error Variable Value Mean Predict 1021 989.8 12.9637 1085 963.9 9.9109 1045 976.8 10.6015 1050 965.3 9.8890 95% CL Predict 846.9 1133 822.0 1106 834.7 1119 823.3 1107 95% CL 963.8 944.0 955.5 945.4 Mean 1016 983.9 998.1 985.2 Residual 31.1739 121.0593 68.1689 84.7013 52 ANOVA Table ANOVA stands for analysis of variance. We use an ANOVA table in regression to organize our estimates of different components of variation. It is important to understand how this works for SLR since we will use ANOVA tables for MLR and ANOVA procedures as well. 53 ANOVA Table   Table consists of variance estimates used to assess the following two questions:  Is there an linear association between the response and predictor(s)?  How “strong” is that linear association? We need to start by understanding the different components of variation for a single data point. 54 Components of Variation 55 Combining Over All Data  We might look at the total deviation as follows:    Yi  Y  Yî  Y  Yi  Yî   But we cannot simply add deviations across data points. Why? Options? 56 Combining Over All Data (2)  Squared deviations are chosen because it turns out that they can be used to estimate variances. It also (conveniently) turns out that: ˆ ˆ Y  Y  Y  Y  Y  Y        n i 1 n 2 i SSTOT i 1  n 2 i SS R i 1  i 2 i SS E 57 Sums of Squares  The total sums of squares (SST) represents the total available variation that could be explained by the predictor (that not already explained by Ybar).  We break this into the two components:  Model/Regression sums of squares (SSR) is the part that is explained by the predictor.  Error sums of squares (SSE) is the part that is still left unexplained. 58 Degrees of Freedom  Each SS has an associated degrees of freedom.  For simple linear regression,    DFT = n – 1  DFR = 1  DFE = n – 2 Always have DFT  DFR  DFE In general, you lose one degree of freedom for each parameter you estimate. Since we estimate Ybar before we start, dfTOT is n – 1. 59 Degrees of Freedom (2)  It is important to understand how DF are assigned with the models that we will be discussing. Some key principles:  DF Total is always 1 less than the number of observations.  You should next determine DF for the model. For regression, each continuous variable requires a slope estimate and takes 1 DF.  Lastly, the error DF is determined by subtraction (avoid memorization of formulas). 60 Mean Squares  A mean square is a SS divided by its associated degrees of freedom.  These are the actual variance estimates: SST 2 sY  MST  (population variance) n 1 SSE 2 s  MSE  (error variance) n2 SS R 2 s  MSR  under null hyp. H 0 : 1  0 1 61 F-tests   Because both MSR and MSE estimate  under the null hypothesis, we may utilize their ratio in order to test whether there is a linear association. 2 If the null hypothesis is true, the statistic F MSR MSE will have an F distribution with 1 and n – 2 DF.  Note: To get your DF for the F-test, simply use the DF for the associated mean squares. 62 Relationship of F to T  For SLR the F test is identical to the t-test for the slope as in fact:  F  ˆ1 S ˆ  1  2 T 2 Additionally you will find that in terms of 2 critical values, F1,v  tv . 63 Example Source Model Error Total DF 1 48 49 Root MSE Dependent Mean Coeff Var Variable Intercept expend Variable Sum of Squares 39722 234586 274308 Mean Square 39722 4887.2 69.90851 R-Square 965.92000 Adj R-Sq 7.23751 Parameter Estimates F Value 8.13 Pr > F 0.0064 0.1448 0.1270 DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1089.29372 -20.89217 44.38995 7.32821 24.54 -2.85 <.0001 0.0064 DF 95% Confidence Limits 64 Example (2)  For SLR, the F-test in ANOVA Table is exactly the same as the test for zero slope.   Note 8.13 = (-2.85)2. Caution: When we get into multiple regression, if the F-test has a small p-value this is a good start, but not the end! In multiple regression, the F-test may be thought of as a test for “model significance”. But it doesn’t tell us which variable(s) are important and which are not. 65 Other Statistics from REG  R-square and Adjusted R-square help us to assess the “strength” of the linear relationship.  The coefficient of variation is calculated using CV  100 MSE Y It measures the variation as a percentage of the mean. 66 Coefficient of Determination The coefficient of determination (R2) gives us some idea as to the strength of the regression relationship. 67 Coefficient of Determination (R2)  Reflects the variation in Y that is explained by the regression relationship as a percentage of the total: SS R SS E R   1 SSTOT SSTOT 2  With perfect linear association, SSE will be zero and R2 will be 1.  If no linear association, SSE will be the same as SST and R2 will be 0. 68 Common Misconceptions  Steeper slope means bigger R2. This is not true. In fact R2 has nothing to do with the magnitude of the slope for our regression line.  The larger the value of R2, the better the model. This is also not true. R2 says nothing about appropriateness of model (see page 98).  R2 could be 0 but there could be a non-linear association between X & Y  R2 could be near 1 while a curvilinear model would be more appropriate (scatterplots will generally reveal this) 69 The Correlation Coefficient (r)  Takes the sign of the slope and, for SLR, is simply the square root of R2.  Dimensionless – ranges between -1 and 1.  Symmetric – interchanging X & Y will not change the correlation between them. 70 SAT Example: Interpret R2  Simple interpretation: 14.8% of the variation in SAT scores is explained by the expenditures.  Reality Check: (1) though significant, this is not a very strong relationship and (2) the slope parameter is negative, suggesting that increasing expenditures is associated with a decrease in the average score! 72 Adjusted R2  Uses the mean squares to adjust (penalize) for the number of parameters in the model SSE /  n  p  MS E R  1  1 SST /  n  1 MSTOT 2 a  We’ll discuss this more in multiple regression as it really isn’t important for SLR. 73 Regression Diagnostics Check Your Assumptions!!! 74 Regression Diagnostics  Assumptions  Correct Model (linearity)  Independent Observations  Normally Distributed Errors  Constant Variance  Checking these generally involves PLOTS of the residuals and predicted values.  Residual = observed – predicted ei  Yi  Yî 75 Regression Diagnostics (2)  Key Point: Most assumption checks may be done visually by looking at various plots. They may also be done using statistical tests. Looking at plots is generally easier!  So the general formula is to check the plots, and if you still have questions then perhaps consider the statistical tests. 76 Checking Normality    Histogram or Box-plot of Residuals  Is the histogram bell-shaped?  Is the box-plot symmetric? Normal Probability / QQ plot  This is the method we would normally use. Ordered residuals are plotted against cumulative normal probabilities and the result should be approximately linear.  PROC UNIVARIATE: QQPLOT statement Shapiro-Wilks or Kolmogorov-Smirnov Test 77 SAS Code: QQPlot proc univariate data=fit noprint; var res; title 'Normal Probability Plot'; qqplot res / normal(l=1 mu=est sigma=est); 78 Constancy of Variance   Plot the residuals against fitted (predicted) values  Check to see if size of residual is somehow associated with predicted value.  Megaphone shapes are indicative of a violation. Bartlett’s or Levene’s Test  Statistical tests are generally sensitive to violations of normality and cannot be used if the normality assumption is not met. 79 Plot: Residuals vs. Fitted Values proc gplot data=fit; plot res*pred /vref=0; 80 Checking Independence  This is the hardest assumption to check.  One check on this assumption is to simply think of how the data are collected. Ask the question: Is there anything in the collection of data that could lead to dependent responses?  Plot the residuals over time (if applicable). Is there a “drift” or other pattern as trials proceed?  Durbin-Watson Test 81 Other Issues  Linearity Assumption: A nonlinear pattern in the residuals vs. predicted values plot suggests that we need to revise our assumption of a linear parametric relationship between X and Y.  Outliers: These will show up in rather obvious ways on the various plots. 82 When the assumptions are violated...  Discarding data is almost always the wrong thing to do. Some things you can do are...   Consider transformations of the data.  Transformations of the response variable [e.g. Log(Y)] often help with normality and/or constancy of variance issues.  Transformations of the predictor variable(s) may solve nonlinearity issues Lastly, we may consider other more complex models. 83 When outliers are present...  Some formal tests exist to classify outliers (we’ll talk about them later)  Investigate – don’t eliminate  Lacking a very good reason (e.g. experimenter made error in recording the data) you should never be throwing an outlier away.  One good thing to do is to try to figure out how much effect the outlier has on your various estimates (we’ll also learn how to do this later) 84 Collaborative Learning Activity Please discuss problem #2.6 from the handout. 85 Questions? 86 Upcoming in Topic 3... Multiple Regression Analysis Related Reading: Chapter 8 87

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Stat 502 - Topic #2