Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 IV. Modeling g a Mean: Simple p Linear Regression g We have talked about inference for a single mean, for comparing two means, means and for comparing several means. means What if the mean of one variable depends on the value of another continuous-type ti t variable? i bl ? In I the th case off a linear li trend t d (i.e., (i the th value of one outcome tends to increase or decrease linearly with an increase in another variable), we can fit a model that describes that trend. The tool most often used for this kind of analysis is called a linear regression model. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Sampling Pairs of Points In this setting, we have n pairs of points (X1,Y1), (X2,Y2),…,(Xn,Yn). In other words, we are measuring two variables for each subject, sampling from a population as demonstrated in the schematic below. In this case we are often interested in how the variables correlate or covary. In other words words, considering the X variable as the explanatory or independent variable, and the Y variable as the outcome or dependent variable, the question is: How does the average E(Y) depends on value of X? Population P l i – E(Y), σY2 E(X), ( ) σX2 ( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn ) Sample – X , s X2 Y , sY2 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Exploratory Analysis As indicated on the previous slide, an important first step in exploring the distributions of paired observations is to use summary statistics and univariate charts (e.g., histograms, boxplots) to understand their marginal or individual distributions. Since we’re interested in how the variables relate to each other, a key subsequent step is to construct a two-dimensional scatterplot, treating Y as a function of X. Here, each point in the plot corresponds to one observation in the sample, and is determined by the correponding (X,Y) coordinate of that observation. To plot a point (Xi, Yi) from the sample: (Xi, Yi) Yi 0 0 Xi Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Patterns of Association A plot helps us quickly to identify relationships between variables that can inform how we model the relationship. For example, what do you observe from the following plot, of the % of physically active adults in each U.S. state versus average annual temperature? (Source: http://www.economist.com/node/21016233.) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Linear versus Nonlinear Associations For now, as we focus on simple regression models, we will look at examples where the relationship between the two variables of interest is linear. linear However However, in applied research settings one might observe a wide variety of patterns. What is the relationship between X and Y in the scatterplot to the right? Later, when we discuss multivariable models we will see how such nonlinear relationships can be accommodated using a regression approach. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Functional versus Statistical Relationships It s also important to distinguish between a purely functional association between It’s two variables and what we might term a statistical association. A functional relationship is one that is deterministic – i.e., a given value of X yields the exact same value for Y whenever an experiment is repeated. An example of this is the distance Y travelled by an object in free fall over time T, which is given by Y 12 gT g 2, where g is the acceleration due to gravity at or near sea level. On the other hand, a statistical relationship is one for which a given value of X yields different values of Y for repetitions of the same experiment. That is, Y is random, and it’s distribution may depend upon X. In the linear regression setting, we assume that th t E(Y) linearly li l depends d d upon X. X Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.A Engineers were interested in the effects of salt distribution on the roadways with salt concentration in adjacent waterways. They gathered data at 20 locations, locations measuring the roadway area at each site along with the salt concentration in the nearby river. The data Th d are shown h on the h following f ll i slide lid (they ( h are also l postedd on the course website in the file “salt.txt”). We would like to know whether or not ggreater roadway y area is associated with higher g average salt concentration. Why is it natural to designate the explanatory variable and response variable in this way? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.A (cont’d) Obs Salt Concentration Roadway Area Obs Salt Concentration Roadway Area 1 3.8 0.19 11 15.6 0.78 2 5.9 0.15 12 20.8 0.81 3 14.1 0.57 13 14.6 0.78 4 10 4 10.4 0 40 0.40 14 16 6 16.6 0 69 0.69 5 14.6 0.70 15 25.6 1.30 6 14.5 0.67 16 20.9 1.05 7 15.1 0.63 17 29.9 1.52 8 11.9 0.47 18 19.6 1.06 9 15 5 15.5 0 75 0.75 19 31 3 31.3 1 74 1.74 10 9.3 0.60 20 32.7 1.62 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.A (cont’d) The scatterplot for these data is given below: 35 SA ALT CONCENTRAT TION 30 25 20 15 10 5 0 0 0.2 0.4 0.6 0.8 1 1.2 ROADWAY AREA 1.4 1.6 1.8 2 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.A (cont’d) What h sort off relationship l i hi do d you observe b between b the h area off a given i road and the amount of salt found in nearby waterways? Based on what we observe, we would like to answer a few questions. These might include: • What is the observed average increase in salt concentration for incrementally larger roadway area? • Is this average increase statistically significant? That is to say, is the observed correlation real, or can it be attributed to chance? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.A (cont’d) One way off answering O i these th sorts t off questions ti is i to t fit a model. d l That is, since we observe a somewhat linear association (average salt concentration appears to increase linearly with increased road area) we fit such a line to the data. Knowing that a line is determined by a slope and an intercept, intercept the question is how do we select the “best” line? A statistical solution to this problem is the so-called least-squares fit, or linear regression i off salt lt concentration t ti on roadd area. The following slide shows this regression line overlaid on the scatterplot of salt versus area. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.A (cont’d) 35 SALT CONCE ENTRATION 30 25 20 15 10 5 0 0 02 0.2 04 0.4 06 0.6 08 0.8 1 12 1.2 ROADWAY AREA 14 1.4 16 1.6 18 1.8 2 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 The Model As noted earlier, in general we sample pairs of points (X1,Y1), (X2,Y Y2),…,(X ) (Xn,Y Yn), ) where h X is i referred f d to as the h explanatory l variable i bl and Y is the response variable. Note that this does not imply that X necessarilyy causes Y, although g that is ppossible. X and Y mayy simply py be associated, without any causative effect. We typically want to explain changes in the average of Y due to a difference in X. If X and Y appear to be linearly associated, where Y on average increases or decreases linearly with an increase in X, then we may posit i the h linear li model d l Yi 0 1 X i i , where the intercept β0 and slope β1 determine the line, and εi models the variability around the line. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 The Error Term The scatterplot p in Example p V.A is a typical yp representation p of variables that are linearly associated: the points form a sort of “cloud”. That is to say, the points do not lie on a straight line, indicating that even though Y tends to increase or decrease linearly with X, a given value of X will not necessarily result in exactly the same value of Y. That is, the relationship is not deterministic. The error term εi in the model accounts for this variability around the line β0 + β1xi. Note that β0 + β1Xi is ffixed,, not random. We further typically assume that εi ~ N(0, σ2). In other words, given the value Xi, we have E(Yi) = β0 + β1Xi, and Var(Yi) = σ2. Therefore, Yi ~ N(β0 + β1Xi, σ2). Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 The Model Parameters What do the terms in the linear model mean? • The intercept β0 represents the average of y for x = 0. Although the intercept is mathematically necessary in order to specify the form of the line in the model, it seldom has practical meaning. p β1 is ggenerallyy the focus of inference: it represents p the • The slope change in the average of y for every one unit increase in x. Since we are interested in how y changes with x, then a nonzero slope indicates that y and x are linearly associated. associated • The variance term σ2 represents the variability of the data around the line. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 The Experiment As always, our object is to infer something about the underlying model parameters by sampling from the population and then analyzing y g the data. Havingg pposited the regression g model,, we can think of the sampling in this way: Population Yi 0 1 X i i , i ~ N(0, 2 ) ( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn ) Sample – estimate ti t β0, β1, and σ from data. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.B Suppose that the purity of a chemical solution Y is related to the amount of catalyst X through a linear regression model with β0 = 123.0, 123 0 β1 = –22.16, 16 and with an error standard deviation of σ = 4.1. 41 What is the expected value of the purity when the catalyst level is 20? How much does the average purity change when the catalyst amount is increased by 10? What is the probability that the purity is less than 60 when the catalyst level is 25? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 In practical research settings, we do not know the actual parameter values As indicated on the schematic two slides previous, values. previous we sample from the population with the posited regression model and then estimate the regression parameters from the data. Example IV.C Write out a linear model for the experiment described in Example IV.A. Clearly interpret each of the parameters of the model. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 How do we estimate the model parameters? That is to say, y, what is the “best” line,, based on the data? In statistical applications, we choose the line that achieves the minimum squared distance between itself and the collective observed data points points. Note that the distance from a given value Yi and its associated point on the line is given by Yi – (β0 + β1Xi). We call this the residual. It turns out that we compute estimates of the slope, intercept, and variance that minimize the sum of the squared q residuals. The resulting estimates of the slope and intercept are given by n b1 i 1 n X iYi nXY i 1 X nX 2 i 2 , b0 Y b1 X . Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Derivation of Parameter Estimates One way of thinking about how b0 and b1 are derived is to consider direct minimization of the sum of squared residuals: n n i 1 i 1 Q i2 [Yi ( 0 1 X i )]2 . We sometimes refer to Q as the objective function. How can we minimize this function with respect to β0 and β1? (Some discussion about this is contained in Section 1.6 of the text, although the technical details are not that important.) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.D Using the data given in Example IV.A, fit the model specified in Example IV.C. The necessary summary statistics are given below: X 0.824 2 X i i 17.2502 Y 17.135 n 20 XY i i i 346.793 What does the estimated intercept represent in the model fit? Interpret the estimated slope – what does it say about the observed relationship between road area and average salt concentration? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Interpreting the Model Fit Note that once we have obtained our estimates of the intercept and slope, the fitted value for yi is given by Yˆi b0 b1 X i . There are two ways of viewing such a fitted value: • The fitted value is our predicted Yi for the given Xi. • The fitted value is our estimate of the average Yi for the given Xi. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.E Based on the model fit in Example IV.D, what is the predicted salt concentration when the adjacent road area is 0.75? What is our estimated average salt concentration level when the adjacent road area is 0.75? What is the predicted salt concentration when the road area is 2.0? Why should we be cautious about this last prediction? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Estimating σ2 The last of the three pparameters that we need to estimate is the model variance, which represents the variability of the yi’s around the regression line. Note that the estimated residuals based upon the model fit are given by ei Yi Yˆi Yi (b0 b1 X i ), i 1,..., n. Therefore, our estimate of the model variance σ2 is the observed “average” squared residual, also called the mean square error (MSE): n n 1 1 SSE 2 2 ˆ s e (Yi Yi ) . i 1 i i 1 n2 n2 n2 2 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.F The table below shows both the observed salt concentration and predicted salt concentration ((usingg the fitted line)) for the observations in Example p IV.A. Salt Concentration Salt Concentration Obs # Observed Predicted Obs # Observed Predicted 1 3.8 6.01 11 15.6 16.36 2 5.9 5.31 12 20.8 16.89 3 14.1 12.68 13 14.6 16.36 4 10.4 9.70 14 16.6 14.78 5 14.6 14.96 15 25.6 25.49 6 14.5 14.43 16 20.9 21.10 7 15.1 13.73 17 29.9 29.35 8 11 9 11.9 10 92 10.92 18 19 6 19.6 21 28 21.28 9 15.5 15.84 19 31.3 33.21 10 9.3 13.20 20 32.7 31.10 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.F (cont’d) Based on the fitted values, values we can see that the residual for the first observation is –2.21, for the second observation the residual is 0.59, and so forth. The average squared residual therefore is given by s 2 20 2 1 e i 1 i 20 2 1 (-2.21)2 (0.59) 2 (1.60) 2 18 3.206 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for the Slope β1 Remember, the fundamental question in a linear regression analysis is whether the dependent and independent variables are linearly associated. As always, there are two aspects to the analysis: • Do we observe a slope that is different from zero? That is, does the average of the outcome variable depend on the value of the explanatory variable? • Do the data provide evidence that the slope is significantly different from zero? That is, is can we infer from our data that the relationship we observe holds for the underlying population? To address the second issue issue, we need to know something about the distribution of the our estimated slope. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Distribution of the Estimated Slope Not surprisingly surprisingly, it turns out that the estimated slope b1 is approximately normally distributed (provided that the sample is random and – in most cases – that the sample size is sufficiently l large). ) The Th mean off the th distribution di t ib ti off b1 is i β1. The Th estimated ti t d standard error is given by s.e.(b1 ) s{b1} s / i 1 ( X i X ) 2 n 1/2 , where h s2 is i the h model d l MSE (or ( estimate i off model d l variance i σ2). ) Since we need to rely on the estimated standard error (i.e., σ is ), then we use the t(n–2) ( ) distribution to obtain a unknown), confidence interval and hypothesis test for β1. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Confidence Interval and Hypothesis Test for β1 A (1 – α)100% confidence interval for β1 is therefore given by b1 t (1 / 2; n 2)s{b1}. We also would like to test the null hypothesis H0 : β1 = 0 versus the alternative hypothesis HA: β1 ≠ 0. A test statistic for assessing the evidence against H0 is given by b1 0 t . s{b1} Under H0, this test statistic approximately follows the t(n–2) distribution. The p-value is therefore given by 2P{t(n–2) ≥ |t|}. Note that we can conceivably test against any specific value of β1, although 0 is generally the value of interest. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.G Give a 95% confidence interval for the slope parameter in the model of Example IV.C, based upon the observed data given in Example IV.A. Interpret this confidence interval. State the null and alternative hypotheses for testing a linear association between road area and average salt concentration. Explain these hypotheses. Carry out a test of the null hypothesis of no linear association. What is the p-value for this test? Is there evidence of a relationship between road area and average salt concentration? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Measuring the Strength of Association Note that the slope is one measure of the linear association between two continuous variables – it tells you how much the average of the outcome variable changes with respect to a one-unit increase in the explanatory variable. However, the estimated slope tells you nothing about the variability of the points about the line. Correlation is a measure of the strength of association between two variables that reflects the degree of variability around the fitted line It line. It’ss another popular summary statistic for illustrating the degree to which variables are linearly associated. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 The slope itself does not always reflect the strength of association… For example, note that in the two plots below we observe two data sets with approximately the same estimated slope. However, the association in the first case looks much stronger, stronger as the cloud of points more tightly clusters about the regression line. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 The Correlation Coefficient The correlation ρ is another population parameter that we can estimate i from f the h data. d We W typically i ll use r to denote d our estimate off ρ. The so-called correlation coefficient r has several important features: • • • • • r has a range of –1 to 1. It is an index, and has no units. The closer r is to 1, the stronger the positive linear association (r = 1 indicates perfect positive correlation). The closer r is to –1, the stronger the negative linear association (r = –11 indicates perfect negative correlation) correlation). An r close to zero indicates weak linear association. If r = 0, this means no linear association. r measures linear association only. Two variables can be highly correlated in a nonlinear way, nevertheless yielding r close to 0. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.H Plots illustratingg various values of r: Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Computing r Our estimated i d correlation l i coefficient ffi i for f two variables i bl X andd Y is i given by n rXY i 1 n i 1 ( X i X )(Yi Y ) (Xi X ) n n i 1 2 n i 1 (Yi Y ) 2 X iYi nXY 2 2 X n X i i 1 n 2 2 Y n Y i 1 i . Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.I Given the five summary statistics in Example IV.D, and that 2 Y i 1 i 7060.03, n what is the correlation coefficient between salt concentration and roadway area? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for r To carry out a test of H0: ρ = 0 versus the alternative hypothesis HA: ρ ≠ 0, we can use this statistic: t r n2 1 r 2 , which approximately follows a t(n–2) distribution. In fact, it turns out that this statistic is algebraically equivalent to the t statistic for testingg that the regression g slope p is equal q to zero. The p-value for this test is given by 2P{t(n–2) ≥ |t|}. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.J Carry out a test of the null hypothesis that the salt concentration and road area are not correlated, versus the alternative hypothesis that they are correlated. correlated What is the p-value of this test? Interpret this result in words. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for Means and Predictions In addition to inferences about the slope, slope we may also want to construct tests and confidence intervals for the regression line, itself. We will talk about inference for: (1) The average of Y given a corresponding value of X, and (2) A predicted value Y given a corresponding value of X. X Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for a Mean Suppose we want to estimate i the h average off Y for f a given i value l off X, denoted by Xh. Our estimated average is Yˆh b0 b1 X h . Thiss est estimate ate has as a standard sta da d error e o given g ve by 2 1 ( X X ) h s{Yˆh } s , 2 n ( X i X ) where s2 is the regression MSE (our estimate of the error variance σ2). Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for a Mean, continued Iff E{Y { h} represents the h actuall mean off Y at the h value l Xh, then h the h statistic Yˆh E{Yh } s{Yˆh } ffollows ll a t(n–2) t( 2) di distribution. t ib ti A 1–α 1 confidence fid interval i t l for f the th mean of Y is therefore given by Yˆh t (1 / 2; n 2) s{Yˆh }. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.K For the roadway data, compute and interpret a 95% confidence interval for the average salt concentration when the corresponding roadway area is 1.0 m2. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for a Prediction As opposed to estimating a mean, suppose instead that we want to make k a prediction di i for f a single l additional ddi i l observation. b i Again, A i as with ih the mean, our estimated predicted value for a given Xh is computed as Yˆh b0 b1 X h . However, in this case the estimated prediction has a standard error given by 1/ 2 1 ( X h X )2 s{pred} s 1 . 2 n ( X i X ) Note the difference between this standard error and the one given for an estimated mean. The extra variability arises since here we are estimating a value for a single observation as opposed to an average over many observations. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Inference for a Prediction, continued Iff Yh(new) represents a randomly d l sampled l d value l off Y for f a corresponding di Xh, then the statistic ˆ Yh (new) Y ( ) h s{pred} ffollows ll a t(n–2) t( 2) di distribution. t ib ti A 1–α 1 confidence fid interval i t l for f a predicted Yh(new) is therefore given by Yˆh t (1 / 2; n 2) s{pred}. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.L For the roadway data, compute and interpret a 95% confidence interval for the predicted salt concentration when the corresponding roadway area is 1.0 m2. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Confidence and Prediction Bands Researchers often find it useful to construct a confidence interval for the regression line over the entire range of X-values. We can accomplish this by computing the confidence intervals presented on th previous the i slides lid either ith for f the th means or the th predictions di ti (depending (d di on the investigative focus). This is obviously accomplished in general by using computer software. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.M The plot on the following slide illustrates confidence and prediction bands for the roadway data. Note the relative widths of the intervals delimited by both sets of bounds. How do you explain the wider intervals for the prediction bands? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Analysis of Variance (ANOVA) for Regression Important information about a regression analysis is generally displayed in an ANOVA table. The underlying principle is that the variation of the Y (or outcome) variable arises from two sources: Total Variation in Y = Variation due to Regression + Unexplained (Residual) Variation Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Sources of Variation In more mathematical terms, terms this relationship can be expressed as: n 2 2 ˆ ˆ ( Y Y ) ( Y Y ) ( Y Y ) i1 i i1 i i1 i i n 2 n where: 2 ( Y Y ) SSTO, i 1 i n 2 ˆ SSR, ( Y Y ) i 1 i n 2 ˆ ( Y Y ) i 1 i i SSE. n Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 ANOVA F Test for Regression Coefficients It turns out that the ANOVA approach provides us with a useful way off testing i coefficients ffi i andd comparing i models d l in i a variety i off settings i (particularly for multiple regression with several variables). For the simple linear regression model, model the ANOVA F statistic for testing H0: β1 = 0 versus HA: β1 ≠ 0 is given by MSR , MSE where MSR is the mean squared error due to regression, or MSR = SSR/df(Regression); and MSE is the mean squared error s2, or MSE = SSE/df(Error). F There are generally nn–11 df associated with SSTO. As we’ve we ve discussed previously, in the simple model there are n–2 df for SSE, leaving 1 df for SSR. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 ANOVA F Test for β1 in the Simple Model For the simple regression model, relatively large values of F provide evidence against the null H0: β1 = 0, and values of F close to 1.0 indicate little or no evidence against the null. The p-value for this F test is determined by computing the upper-tail probability for the observed statistic with respect to the F(1,n–2) distribution. Note that in the simple case, it turns out that the ANOVA F test and th t test the t t for f the th slope l (discussed (di d earlier) li ) are identical. id ti l That Th t is, i 2 MSR 2 b1 0 F t . MSE s{b1} Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 ANOVA Table All ll off this hi is i summarized i d in i a table, bl typically i ll in i this hi familiar f ili form: f Source Degrees of Freedom Sum of Squares Mean Squares F-statistic R Regression i 1 SSR MSR MSR/MSE Error n–2 SSE MSE Total n–1 SSTO p-value Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example V.N The ANOVA ttable Th bl ffor th the roadway d d t is data i partially ti ll completed l t d below. Can you fill in the missing information? Source Degrees of Freedom Regression Error Total Sum of Squares 1130.15 18 1187.87 Mean Squares F-statistic F statistic p-value p value Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.N (cont’d) Based on the results of the ANOVA procedure, what are your conclusions regarding the association between roadway area and salt concentration? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Checking Model Assumptions What are some of the underlying assumptions we have discussed with i h respect to the h simple i l regression i model? d l? 1. 2. 3. 4. 5. 6. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Residuals and Standardized Residuals Examining the observed residuals can provide key diagnostic i f information i about b whether h h model d l assumptions i are violated. i l d Recall R ll that the residual ei for the ith subject is given by ei Yi Yˆi , i 1,..., n. Since the actual variance of the residuals is σ2, the estimated variance is given by the MSE. It turns out that computing the actual standard deviation of the residuals is a little more complex than simply taking (MSE)1/2, but this estimate is not too far off. We therefore define what is referred to as the semistandardized or semistudentized residual as e * i ei . MSE Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Exploration of Residuals A regression analysis is generally accompanied by an examination of the h residuals id l or standardized d di d residuals, id l to assess • the linearityy of the relationshipp between X and Y, • the normality of the residuals, • the constancy of the residual variance across the range of X, • the h independence i d d off the h residuals, id l p to X and Y), ) and • effects of ppotential outliers ((both with respect • whether any additional explanatory factors may have been omitted. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Diagnostics for Linearity A scatterplot is one of the best ways to assess the nature of the X-Y relationship, but ap plot of the residuals (versus ( either the predictor p variable X or the fitted values)) can also reveal patterns that could indicate nonlinearity. Note the nonlinear pattern in the plots below: Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.O Is there anything in the residual plot (below) for the roadway data to indicate nonlinearity? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Evaluating Non-constant Variance Residual plots can also be very useful in assessing whether the variance remains constant across the range of X. X The plots below illustrate a classic pattern where this assumption is not met: In examining the residual plot in Example IV.O, is there any evidence of nonconstant variance for the roadway data? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Evaluating Dependence Between Residuals This hi can sometimes i be b tricky, i k but b dependence d d most often f manifests if itself with respect to the sequence, or temporal ordering, of the measurements. Where an investigator knows the order in which observations were sampled he or she ought to plot the residuals versus sampling sampled, sequence to ensure there is no systematic correlation between contiguous observations. Note that information about sampling order may not always be available. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.P Consider the data plotted below. How do the plots look in terms of linearity and variance? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.P (continued) The plot below shows the relationship between the residuals and the order of measurements for the data plotted on the previous slide. slide What do you observe? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Outliers In addition to initial univariate exploratory analyses, analyses residual plots can be useful for identifying outliers. Note N t that th t outliers tli with ith respectt to t the th distribution di t ib ti off Y or X in i a regression setting can potentially influence the model fit in dramatically different ways. In some cases, outliers may not have any appreciable effect on the analysis. Simply identifying outliers is no reason to simply throw them out – such observations must be examined individually to (hopefully) explain why they have relatively extreme values. An outlier may exist i t because b off miscoding, i di incorrect i t sampling, li or even just j t sheer h randomness. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.Q Note the outlier below with respect to the distribution of the Y variable. What effect (if any) does this observation have on the model fit? OUTLIER Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.Q (continued) A plot below of the residuals for the data on the previous slide clearly identifies the outlier. Interestingly, the observation appears to be exerting very little influence on the model fit. OUTLIER Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.R In the plot below, the outlier is extreme in particular with respect to the distribution of the X variable. These kinds of outliers can be particularly problematic in terms of their influence on model fit. OUTLIER Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Normality of Residuals A conventional univariate analysis (i (i.e., e with summary statistics, statistics boxplots, etc.) can be useful in examining the distribution of residuals. The so-called normal probability plot (also known as a normalquantile or Q-Q plot) is also useful for assessing the normality of residuals. AQ Q plot for a given sample is constructed by plotting the empirical AQ-Q standardized quantiles for the data against the quantiles that would be expected given the data arise from a normal distribution. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.S AQ Q-Q Q plot for the residuals from the roadway data model is shown on the following slide. Note that the if the plotted data are at least approximately normally distributed, then the points should roughly follow a straight line. What is your interpretation of this plot? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.S (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.T The following two slides illustrate examples of Q-Q Q Q plots for non-normal data. What is the nature of the deviation from normality in each case? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.T (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.T (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Variable Transformations Problems bl with i h nonlinearity, li i non-constant variance, i or nonnormality li can frequently be fixed with a simple transformation. Logarithmic and power transformations are the most widely applied. The following example illustrates the utility of this approach. approach Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U The h data d for f this hi example l come from f a study d off water use andd household income in Concord, NH, during the summer of 1981 (the dataset is pposted as “concord.txt” on the course website). ) The following three slides contain a scatterplot with fitted regression line along with two residual plots. plots What potential problems, if any, do you observe with respect to model assumptions? i ? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Yˆ 1201.1 47.5 X Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) In this case, case because of the positive skew of the water use distribution, as well as the increasing variance of the residuals, it would be useful to explore a log transformation or a transformation using a power < 1. The following six slides illustrate alternative fits for these data, first using a log transformation, and second with a transformation using a power of 0.3 for water use. What are your conclusions? How do you interpret the fitted coefficients in each case? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) log(Yˆ ) 7.016 0.022 X Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Yˆ 0.30 8.316 0.063 X Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Example IV.U (continued) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2013 Additional Notes on Diagnostics • Assessing the normality of residuals can be a bit tricky under certain circumstances. For example, residuals may actually be normally distributed, but plots (such as boxplots or Q-Q plots) can appear nonnormal because of (i) randomness (especially with a small sample size), or (ii) the exclusion of one or more additional key variables. It is usually a good idea to check other assumptions i first fi – such h as li linearity i andd nonconstant variance i – before b f checking h ki normality. • Even where the outcome variable isn’t exactly normally distributed, substantive conclusions based on a regression model fit may still be fundamentally correct given a relatively large sample size. This is in some sense due to the fact that we are estimating i i an average, meaning i that h the h Central C l Limit Li i Theorem Th applies li to the h distribution of the fitted mean. • We have not illustrated here with an example, but to check the possibility that other variables are additionally associated with Y, we generally begin simply by constructing additional scatterplots.