Download Document

Presentation 13 Regression Analysis Regression     In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships between two continuous variables. Regression is used to describe a relationship between a quantitative response variable and one or more quantitative predictor variables. In this class we discuss simple linear regression, which is describing a linear relationship between a single response Y and a single predictor X. The approach to this problem is to determine an equation by which the average value of particular random variable (Y) can be estimated based on the values of the other variable (X). This problem is called regression. Example: Is there a relationship between X= the concentration of iron in the diet and Y= iron in the blood? If we can determine a relationship (i.e. an equation) between the two variables, then we might use this equation 1. 2. To find the mean concentration of iron in the blood for individuals with a specific concentration of iron in their diet, e.g. for X=80ppm. To predict a someone’s concentration of iron in the blood based on the concentration of iron in his/hers diet. Some terms and notation  Y = the response variable (dependent variable) and is of primary interest.  X = the predictor variable (explanatory, or independent variable).    We want to find an equation of the E(Y) in terms of X. We will call this function regression line of Y on X. This equation is of the form E(Y)=β0 + β1x where,  E (Y) is actually E(Y|X=x), and is the expected value of Y for individuals in the population with the same particular value of X.  β0 is the intercept of the straight line (i.e. the value of E(Y) for x=0).  β1 is the slope of this line. (when do we have β1 =0?) Once we know the slope and the intercept, then, for a given value of X we can obtain the expected value of Y. However, we cannot know the values of β0 and β1 (they are population parameters). Our goal is to estimate the parameters of the regression line using the observed data (x1,y1), …, (xn,yn). What is the first thing we need to check?  A plot such this is called scatterplot and it can be obtained in Minitab by clicking on Graph/Scatterplot. The first plot indicates that there is a linear relationship between these two variables and it is reasonable to proceed with simple (linear) regression analysis. On the other hand, the data in the second plot demonstrates a non linear relationship between the X and the Y variable. Scatterplot of Y vs X Scatterplot of Y vs X 25 72 70 20 68 66 15 Y  We first need to determine if it is appropriate to use the linear regression model. One way to check this is to plot the observed data pairs (x1,y1), …, (xn,yn). Y  64 10 62 60 5 58 0 56 60 62 64 66 X 68 70 60 62 64 66 X 68 70 Two assumptions about deviations from the regression line  Furthermore, in order to make statistical inferences about the population, we need to make two assumptions about how the yvalues vary from the population regression line: 1. 2. The general size of the deviation of the y values from the line is the same for all values of x (constant variance assumption). For any specific value of x, the distribution of y values is normal. Simple Regression Model for a Population  The model we are going to use is y = Mean + Deviation 1. 2.  Mean: in the population is the line E(Y ) = β0 + β1x if the relationship is linear. Individual’s deviation = y - mean, which is what is left unexplained after accounting for the mean y value at that individual’s x value. Putting all the assumptions together (linear relation between X and Y, constant variance and normality) we have that: yi =β0 + β1xi +εi=E(yi )+ εi, where εi are assumed to follow a normal distribution with mean 0 and standard deviation σ (i.e. the same s.d. for all i’s). Regression Line in the Sample  Once we decide that the relationship between X and Y is linear, using out data set we estimate the parameters of then regression equation β0 and β1. But how do we estimate the regression line? Which line is the "optimal"? Method of Least Squares    We will use the method of least squares to obtain the estimates of β0 and β1 (i.e. to specify the line). The idea behind this method is to choose the line that comes as close as possible to all the data points simultaneously. The estimates of these parameters are denoted by b0 and b1 respectively and the estimated regression line is yˆ  b0  b1 x where y-hat is the estimated value of y for X=x, b0 = the sample intercept of the linear regression line, and b1 = the sample slope of the linear regression line. Method of LS -Deviations from the Regression Line  The distance between an actual data point, yi and the estimated line of regression is called a residual (or error). Thus, for an observation yi in the sample, the residual is ei  yi  yˆi  yi  (b0  b1 xi ), where xi is the value of the explanatory variable for the observation.   Therefore, we have a residual for each data point and they are denoted e1,e2,…,en. The method of least squares will find the values of b0 and b1 minimizing the sum of the squared residuals, n n i 1 i 1 SSE   ei2   [ yi  (b0  b1 xi )]2 . Example   The regression line on the previous plot is the dotted line. Using the regression equation we can predict the average response value (y-hat) when the predictor variable assumes some value x. For example, in the dad's - son's problem, using the data, the estimated intercept and slope turned out to be b0=3.41 and b1=.97. We can estimate the average height of man that his dad is 70 in. tall by yˆ  b0  b1 x  3.41  (.97)70  71.04in. Notes…    For x=0, y-hat=b0. The estimated slope, b1, tell us how much of an increase (or decrease, if negative) there is for y-hat when the x variable increases by one unit. You CAN NOT use a regression line to predict the response for observations that fall outside your predictor range. Standard deviation for regression  We can estimate the population standard deviation of y, σ, with s   Sum of Squared Residuals SSE   n2 n2 2 ˆ   y  y  i i n2 . This is called the standard deviation for regression and it roughly measures the average deviation of y values from the mean (the regression line). This is a useful statistic for describing individual variation in a regression problem. Small s indicates that individual data points fall close to the line, thus, it provides information about how accurately the regression equation might predict y values. Example - Height and Weight Data: x = heights (in inches) y = weight (pounds) of n = 43 male students. Standard deviation s = 24.00 (pounds): Roughly measures, for any given height, the general size of the deviations of individual weights from the mean weight for the height. Correlation   Correlation, r, between two quantitative variables is a number that indicates the strength and the direction of a straight-line relationship. Some properties of r are: 1. It is always between -1 and 1. 2. The magnitude of the correlation indicates the strength of the relationship. A correlation of either -1 or +1 indicates that there is a perfect linear relationship. A correlation of zero means no relationship. 3. The sign of the correlation indicates the direction of the relationship. A positive correlation indicates that when one variable increases the other is likely to increase as well, and a negative correlation indicates that when one variable increases the other is likely to decrease. 4. Thus, the sign of r is the same as the sign of b1! Correlation Examples R = -.056, No Relationship Y 20 0 0 -10 -20 -20 20 40 X 60 80 100 0 20 40 X 60 80 100 R = 1, Perfect Positive Relationship Y 50 -60 100 Y -40 -20 150 0 200 R = -.921, Strong Negative Relationship -80 0 0 -100 Y 10 40 20 R = .486, Weak Positive Relationship 0 20 40 X 60 80 100 0 20 40 X 60 80 100 Proportion of Variation Explained by x  Squared correlation, r 2, is between 0 and 1 and indicates the proportion of variation in the response explained by x. r2  SSTO  SSE SSTO SSTO = sum of squares total = sum of squared differences between observed y values and y (sample mean of y’s). SSE = sum of squared errors (residuals) = sum of squared differences between observed y values and predicted values based on least squares line. Example Iron Blood 99.02 31.67 73.29 18.71 95.73 23.71 66.49 23.23 59.14 20.79 98.91 25.91 76.40 22.45 … … Y = 5.95 + 0.194*X 20 15 10 IronBlood 25 30 Iron Diet 35 Is there a relationship between X= the concentration of iron in the diet and Y= iron in the blood? 40 60 Regression Summary: b0 = 5.95 b1 = 0.194 r = .839 r-sq = .703 80 100 IronDiet 120 140 Explanation of Terms b1 = the sample slope. For every unit increase in X we expect Y to increase by b1. Example: For every increase of 1mg of iron in the diet we expect blood iron to increase by 0.194 mg. r = the correlation, varies between -1 and 1. A correlation of -1 means a perfect negative relationship, a correlation of +1 means a perfect positive relationship. A correlation of zero means no relationship. Example: Our correlation of 0.839 indicates a strong positive relationship. r-sq = the percent of variation in the response variable that is explained by the predictor. Example: Our r-sq of .703 means that 70.3% of the individual variation in blood iron concentration can be explained by iron in the diet. Example: Driver Age and Maximum Legibility Distance of Highway Signs Study to examine relationship between age and maximum distance at which drivers can read a newly designed sign. Average Distance = 577 – 3.01 × Age Example: Age and Distance Cont. SSE = 69334 SSTO = 193667 s  SSE n2 69334  49.76 28 SSTO  SSE SSTO 193667  69334   .642 193667 r2  s = 49.76 and R-sq = 64.2%. Thus, the average distance from regression line is about 50 feet, and 64.2% of the variation in sign reading distances is explained by age. Inference About Linear Regression Relationship      Inference about a linear relationship can be evaluated through inference about the slope, β1. We will see how to create CI for β1 and how to test whether or not the β1 is 0. As in any other type of CI or hypothesis test we will need a sample estimate of the parameter of interest β1, and the standard error of this estimate. These quantities are b1 and se(b1) and you do not need to know their formulas or how to calculate them from the data. You will need to know how to get them form the Minitab output and how to use them. The results of the CI or hypothesis test analysis are meaningful only if the assumptions of the regression model are valid. We will see how we can check them using different plots towards the end of this chapter. CI for Slope  A Confidence Interval for a Population Slope β1 is b1  t  s.e.b1  * where the multiplier t * is the value in a t-distribution with degrees of freedom = df = n – 2, such that the area between –t * and t * equals the desired confidence level. (Found from Table A.2.)  Interpretation: This CI gives the range of the expected increase of y for one unit increase in x. Testing For Significance    How do we test if there is a significant relationship between 2 quantitative variables? Perform a test of slope! Ho: β1 = 0 (No relationship) Ha: β1 ≠ 0 (There is a relationship) Remember Test Statistic Formula: Test Statistic    Sample Est. - Null Value b -0  1 t Null Std. Error se(b1 ) The test statistic has a t distribution with n-2 df, if the null hypothesis is true. Thus, p-value=2P( Tdf=n-2> t). If p-value is less than the critical value (usually .05), then reject the null hypothesis. Conclude there IS a linear relationship between the two variables and say whether it is positive or negative depending on the sign of b1. Example: Age and Distance (cont) 95% CI for the Slope: b1  t *  s.e.b1   3.01  2.05  0.4243  3.01  0.87  3.88 to  2.14 feet With 95% confidence, we can estimate that in the population of drivers represented by this sample, the mean sign-reading distance decreases somewhere between 3.88 and 2.14 feet for each one-year increase in age. If we consider the test H0: b1 = 0 vs Ha: b1  0, we have t b1  0  3.0068  0   7.09, and p - value  0.000 s.e.b1  0.4243 The p-value suggests that the probability that observed slope could be as far from 0 or farther if there is no linear relationship in population is virtually 0. The relationship in the sample is significant and represents a real relationship in the population. Prediction and Confidence Intervals  A 95% prediction interval estimates the value of y for an individual with a particular value of x. This interval can be interpreted in two equivalent ways: 1. It estimates the (central) 95% of the values of y for members of population with specified value of x. 2. With probability .95, the response of a randomly selected individual from the population with a specified value of x falls into the 95% prediction interval.  A 95% confidence interval for the mean estimates the mean value of the response variable y, E(Y ), for (all) individuals with a particular value of x. You do not need to know the formulas for these intervals, just how to get them from the Minitab output and how to interpreter them.   For a given x, which interval is wider, PI or CI? Example: Age and Distance (cont) Probability is 0.95 that a randomly selected … 21-year-old will read the sign at somewhere between 407 and 620 feet. 30-year-old will read the sign at somewhere between 381 and 592 feet. 45-year-old will read the sign at somewhere between 338 and 545 feet. With 95% confidence, we can estimate that the mean reading distance of ... 21-year-old is somewhere between 482 and 546 feet. 30-year-old is somewhere between 460 and 513 feet. 45-year-old is somewhere between 422 and 461 feet. How to Check Conditions for Simple Linear Regression 1. Relationship must be linear. If you perform a scatter plot of X and Y and the relationship is obviously curved then this assumption is violated. 2. There should not be any extreme outliers. Check the scatter plot of X and Y for extreme outlying values. 3. Constant variance, the standard deviation of the values of y from the fitted line is the same regardless of the x-variable. Check this with a scatter plot of residuals versus x. How should it look like? 4. The residuals are normally distributed. Check this with a histogram of the residuals. How should it look like? - This condition can be relaxed if the sample size is large. Detailed Example: Suppose we are interested in the relationship between high school GPA and the amount of sleep a student gets. For 100 students we record their GPA and average hours of sleep. A. Fit a simple linear regression line to the data. B. Check the conditions for a hypothesis test and CI of slope. C. Test to see if there is a significant relationship between the 2 variables. D. Construct and interpret a 95% CI for the slope. E. Suppose a student gets 10 hours of sleep. What would their expected GPA be? Is this a good estimate? Explain in terms of r-sq. F. Suppose a student gets 18 hours of sleep. Can we predict the GPA of this student using the regression equation? A. MINITAB: Fitted Line Plot GPA = 2.50472 + 0.0615188 Sleep (Hours) S = 0.293792 R-Sq = 8.0 % R = 0.282 R-Sq(adj) = 7.0 % R-Sq = 8.0% 3.8 3.6 b1 = 0.0615 3.4 GPA 3.2 b0 = 2.50 3.0 2.8 S=0.29379 2.6 2.4 2.2 2.0 3.5 4.5 5.5 6.5 7.5 8.5 Sleep (Hours) 9.5 10.5 11.5 B. Check Conditions 2. 3. From scatter plot seems reasonably linear. From scatter plot, doesn’t seem like there are any extreme outliers. Variance seems 4. Residuals are approx. normal. constant along X. 15 Frequency 1. 10 5 0 -1 0 RESI1 1 C. Hypothesis Test for Slope: Regression Analysis: GPA versus Sleep (Hours) The regression equation is GPA = 2.50 + 0.0615 Sleep (Hours) Predictor Coef Constant 2.5047 Sleep 0.06152 S = 0.2938 SE Coef 0.1708 0.02114 R-Sq = 8.0% T P 14.67 0.000 2.91 0.004 R-Sq(adj) = 7.0% Based on the p-value of .004 we can REJECT the null hypothesis. Conclude there is a significant positive relationship between GPA and sleep. D. 95% CI for Slope: b1 ± t* SE(b1) = .0615 ± 1.99 * .02114 CI = (.0194,.104) We are 95% confident that the true population slope is between .0194 and .104. We are 95% confident that for each additional hour of sleep the expected GPA will increase between .0194 and .104 units. Expected GPA E. Equation is: Y = 2.50 + 0.0615*X The predicted GPA for 10 hours of sleep is: 2.5 + 0.0615*10 = 3.115 For someone who gets 10 hours of sleep we expect them to have a GPA of 3.115 This will NOT be a very good predictor because the r-squared value is only .08. Sleeping hours only explains 8% of the variation in GPA. Most of the variation in GPA is unaccounted for. F. We can not use the regression equation to predict the GPA of a student that sleeps 18 hours per day because 18 hours is not in the range of the values of X in the data set. Exercise 14.47: Height and Foot Length. a. There is a linear relationship with a positive slope, and there is an obvious outlier in the data. b. With the outlier omitted from the data set, the Minitab regression output is: The regression equation is height = ____ + ____ foot Predictor Coef SE Coef T P Constant 30.150 6.541 4.61 0.000 foot 1.4952 0.2351 6.36 0.000 S = 2.029 R-Sq = 57.4% R-Sq(adj) = 56.0% What is the regression equation? What is r? What is se(b1)? What is the test statistic for testing the hypothesis that the slope is zero? Verify the value. Exercise 14.34 d. e. The regression line doesn't provide particularly accurate predictions of height based on foot length. Notice the standard deviation from the regression line is given in the output as s = 2.029 inches. This is roughly the average difference between actual heights and predicted heights determined from the line. The residual plot shows that a linear equation is probably appropriate, there are no outliers, and it's reasonable to make the constant variance assumption (although it may be that there is less variation among residuals for small foot lengths than for large foot lengths).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document