* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download R 2
Survey
Document related concepts
Transcript
Linear Regression Models for Data Correlation says “There seems to be a linear association between these two variables,” but it doesn’t tell what that association is. We can say more about the linear relationship between two quantitative variables with a linear model. The linear model is just an equation which best describes the linear relationship of the two variables involved. 2 What Do Models Do? A model simplifies reality to help us understand underlying patterns and relationships. Models require numbers, called parameters, to specify them. Choosing values for the parameters helps us mold the model to fit a particular situation. 3 This figure shows one variable plotted against another and it appears as if there may be a line which we could use as a predictor for future relationships. 4 The model that we use is of the form y mx b The y ’s are called predicted values and are predicted from the x ’s or the predictors. The value m is the slope: – larger magnitude m’s indicate steeper slopes, – negative slopes show a negative association, – and a slope of zero gives a horizontal line. 5 How Big Can Predicted Values Get? Since r cannot be bigger than 1 (in absolute value), each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean; the line is called the regression line. 6 Residuals No model is perfect, so there will be differences between observed and predicted values. The difference between the observed value and its associated predicted value is called the residual. The residual at each data value tells us how far off the model’s prediction is at that point. 7 Residuals (cont.) Observed y value for this particular x Predicted y value using the line of regression The residual is the difference between these two y values for the same x value. 8 Residuals (cont.) The linear model assumes that the relationship between the two variables is a perfect straight line. The residuals are the part of the data that hasn’t been modeled. So, Data = Model + Residual or (equivalently) Residual = Data – Model. 9 Residuals (cont.) When we ask how well the model fits, we are really asking how much of the data is still in the residuals. 10 Residuals (cont.) Residuals also help us to see whether the model makes sense. We determine this by plotting the residuals in the hope of finding…nothing. 11 2 R —The Variation Accounted For The variation in the residuals is the key to assessing how well the model fits. If the correlation were -1 or +1, the model would predict perfectly, and the residuals would all be zero and have no variation. We can’t do any better than that! At the other extreme, if the correlation were 0, the model would simply predict the mean yvalue for each x-value. These residuals would have the same variation as the original data. But, neither extreme is likely. So… 12 2 R —The Variation Accounted For (cont.) The squared correlation, r2, gives the fraction of the data’s variance accounted for by the model. Thus, 1– r2 is the fraction of the original variance left in the residuals. All regression analyses include this statistic, although by tradition, it is written R2 (pronounced “R-squared”). An R2 of 0 means that none of the variance in the data is in the model; all of it is still in the residuals. 13 2 How Big Should R Be? R2 is always between 0% and 100%. What makes a “good” R2 value depends on the kind of data you are analyzing and on what you want to do with it. The standard deviation of the residuals can give us more information about the usefulness of the regression by telling us how much scatter there is around the line. 14 2 Reporting R Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data. Statistics is about variation, and R2 measures the success of the regression model in terms of the fraction of the variation of y accounted for by the regression. 15 Least Squares Our regression line has the special property that the variation of its residuals is the smallest it can be for any straight line model for these data. No other line has this property. Speaking mathematically, we say that this line minimizes the sum of the squared residuals. This is why it is often called the least squares regression line. 16 Assumptions and Conditions Linearity Assumption: – The linear model assumes that the relationship between the variables is linear. – A scatterplot will let you check that the assumption is reasonable. – The straight enough condition is satisfied if the scatterplot looks reasonably straight. 17 Assumptions and Conditions (cont.) It’s a good idea to check linearity again after computing the regression when we can examine the residuals. You should also check for outliers, which could change the regression. If the data seem to clump or cluster in the scatterplot, that could be a sign of trouble worth looking into further. 18 Assumptions and Conditions (cont.) If the scatterplot is not straight enough, stop here. – You can’t use a linear model for any two variables, even if they are related. – They must have a linear association or the model won’t mean a thing. Some nonlinear relationships can be saved by re-expressing the data to make the scatterplot more linear. 19 Reality Check: Is the Regression Reasonable? Statistics don’t come out of nowhere. They are based on data. – The results of a statistical analysis should reinforce your common sense, not fly in its face. – If the results are surprising, then either you’ve learned something new about the world or your analysis is wrong. When you perform a regression, think about the coefficients and ask yourself whether they make sense. 20 What Can Go Wrong? Don’t fit a straight line to a nonlinear relationship. Beware extraordinary points (y-values that stand off from the linear pattern or extreme x-values). Don’t extrapolate beyond the data—the linear model may no longer hold outside of the range of the data. Don’t infer that x causes y just because there is a good linear model for their relationship— association is not causation. 21 Key Concepts If two quantitative variables appear to be linearly related, we can find the regression line that best fits the data. Once we find the regression equation, we can make predictions for y based on x-values in the range of the data. R2 gives the fraction of the variability of y accounted for by the least squares linear regression on x. Examine residuals to check whether a linear model is appropriate. 22