Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 LSRL Bivariate data • x – variable: is the independent or explanatory variable • y- variable: is the dependent or response variable • Use x to predict y yˆ a bx ŷ - (y-hat) means the predicted y b – is the slope – it is the amountBe by sure whichtoy increases when put the hat x increases by 1 unit on the y a – is the y-intercept – it is the height of the line when x = 0 – in some situations, the y-intercept has no meaning Least Squares Regression Line LSRL • The line that gives the best fit to the data set • The line that minimizes the sum of the squares of the deviations from the line (3,10) y =.5(6) + 4 = 7 4.5 2 – 7 = -5 y =.5(0) + 4 = 4 ˆ y .5 x 4 0 – 4 = -4 y =.5(3) + 4 = 5.5 -4 (0,0) 10 – 5.5 = 4.5 -5 (6,2) Sum of the squares = 61.25 What is the sum of the deviations from the line? Will it always be zero? (3,10) Use a calculator to find the line of best fit 6 1 ŷ x 3 3 Find y - y The line that minimizes the sum of the squares of the deviations from the line -3 is the LSRL. (0,0) -3 (6,2) Sum of the squares = 54 Interpretations Slope: For each unit increase in x, there is an approximate increase/decrease of b in y. Correlation coefficient: There is a direction, strength, type of association between x and y. The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Find the LSRL. Interpret the slope and correlation coefficient in the context of the problem. Correlation coefficient: There is a strong, positive, linear association between the age and height of children. Slope: For an increase in age of one month, there is an approximate increase of .34 inches in heights of children. The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60 Predict the height of a child who is 4.5 years old. Predict the height of someone who is 20 years old. Extrapolation • The LSRL should not be used to predict y for values of x outside the data set. • It is unknown whether the pattern observed in the scatterplot continues outside this range. The ages (in months) and heights (in inches) of seven children are given. x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 Calculate x & y. Plot the point (x, y) on the LSRL. Will this point always be on the LSRL? 60 The correlation coefficient and the LSRL are both non-resistant measures. Formulas – on chart yˆ b0 b1 x b1 x x y y x x i i 2 i b0 y b1 x b1 r sy sx The following statistics are found for the variables posted speed limit and the average number of accidents. x 40, s x 11 .6, y 18, s y 8.4, r .9981 Find the LSRL & predict the number of accidents for a posted speed limit of 50 mph. ˆ y .723 x 10 .92 ˆ y 25.23 accidents Correlation Suppose we found the age and weight of a sample of 10 adults. Create a scatterplot of the data below. Is there any relationship between the age and weight of these adults? Age 24 30 41 28 50 46 49 35 20 39 Wt 256 124 320 185 158 129 103 196 110 130 Suppose we found the height and weight of a sample of 10 adults. Create a scatterplot of the data below. Is there any relationship between the height and weight of these adults? Is it positive or negative? Weak or strong? Ht 74 65 77 72 68 60 62 73 61 64 Wt 256 124 320 185 158 129 103 196 110 130 The closer the points in a The farther away from a scatterplot are to a straight straight line – the weaker the line - the stronger the relationship relationship. Identify as having a positive association, a negative association, or no association. 1. Heights of mothers & heights of their adult + daughters 2. Age of a car in years and its current value 3. Weight of a person and calories consumed + 4. Height of a person and the person’s birth NO month 5. Number of hours spent in safety training and the number of accidents that occur Correlation Coefficient (r)• A quantitative assessment of the strength & direction of the linear relationship between bivariate, quantitative data • Pearson’s sample correlation is used most • parameter - r rho) • statistic - r xi x yi y 1 r n 1 s x s y Speed Limit (mph) 55 50 45 40 30 20 Avg. # of accidents (weekly) 28 25 21 17 11 6 Calculate r. Interpret r in context. There is a strong, positive, linear relationship between speed limit and average number of accidents per week. Properties of r (correlation coefficient) • legitimate values of r are [-1,1] No Correlation Strong correlation Moderate Correlation Weak correlation -1 -.8 -.5 0 .5 .8 1 •value of r does not depend on the unit of measurement for either variable x (in mm) 12 15 y 4 7 21 10 32 14 26 9 19 8 24 12 Find r. Change to cm & find r. The correlations are the same. •value of r does not depend on which of the two variables is labeled x x y 12 4 15 7 21 10 32 14 26 9 19 8 24 12 Switch x & y & find r. The correlations are the same. •value of r is non-resistant x y 12 4 15 7 21 10 32 14 26 9 19 8 24 22 Find r. Outliers affect the correlation coefficient •value of r is a measure of the extent to which x & y are linearly related A value of r close to zero does not rule out any strong relationship between x and y. r = 0, but has a definite relationship! Minister data: r = .9999 (Data on Elmo) So does an increase in ministers cause an increase in consumption of rum? Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Residuals, Residual Plots, & Influential points Residuals (error) • The vertical deviation between the observations & the LSRL • the sum of the residuals is always zero • error = observed - expected residual y yˆ Residual plot • A scatterplot of the (x, residual) pairs. • Residuals can be graphed against other statistics besides x • Purpose is to tell if a linear association exist between the x & y variables • If no pattern exists between the points in the residual plot, then the association is linear. Residuals Residuals x Linear x Not linear Range of Motion 35 154 24 142 40 137 31 133 28 122 25 126 26 135 16 135 14 108 20 120 21 127 30 122 One measure of the success of knee surgery is post-surgical range of motion for the knee joint following a knee dislocation. Is there a linear relationship between age & range of motion? Sketch a residual plot. Residuals Age x Since there is no pattern in the residual plot, there is a linear relationship between age and range of motion Range of Motion 35 154 24 142 40 137 31 133 28 122 25 126 26 135 16 135 14 108 20 120 21 127 30 122 Plot the residuals against the yhats. How does this residual plot compare to the previous one? Residuals Age ŷ Residuals Residuals x Residual plots are the same no matter if plotted against x or y-hat. ŷ Coefficient of determination• r2 • gives the proportion of variation in y that can be attributed to an approximate linear relationship between x & y • remains the same no matter which variable is labeled x Age Range of Motion 35 154 24 142 40 137 31 133 28 122 25 126 26 135 Sum of the squared 16 135 residuals (errors) using 14 108 of y. the mean 20 120 21 127 30 122 Let’s examine r2. Suppose you were going to predict a future y but you didn’t know the x-value. Your best guess would be the overall mean of the existing y’s. Now, find the sum of the squared residuals (errors). L3 = (L2130.0833)^2. Do 1VARSTAT on L3 to find the sum. SSEy = 1564.917 Age Range of Motion 35 154 24 142 40 137 31 133 28 122 25 126 26 Sum of the 135 squared 16residuals (errors) 135 14using the LSRL. 108 20 120 21 127 30 122 Now suppose you were going to predict a future y but you DO know the x-value. Your best guess would be the point on the LSRL for that x-value (y-hat). Find the LSRL & store in Y1. In L3 = Y1(L1) to calculate the predicted y for each x-value. Now, find the sum of the squared residuals (errors). In L4 = (L2-L3)^2. Do 1VARSTAT on L4 to find the sum. SSEy = 1085.735 Age Range of Motion 35 154 SSEy = 1564.917 24 142 SSEy = 1085.735 40 137 31 133 28 122 25 126 26 135 16 14 20 21 30 By what percent did the sum of the squared error go down when you went from just an “overall mean” model to the “regression on x” model? SSE y of SSE This is 135 r2 – the amount the ˆy 108 variation in the y-values that is SSE y explained 120 by the x-values. 1564 .91667 1085 .735 .3062 127 1564 .91667 122 Age 35 Range of Motion 154 24 142 40 137 31 133 28 122 25 126 26 135 16 135 14 108 20 120 21 127 30 122 How well does age predict the range of motion after knee surgery? Approximately 30.6% of the variation in range of motion after knee surgery can be explained by the linear regression of age and range of motion. Interpretation of 2 r Approximately r2% of the variation in y can be explained by the LSRL of x & y. Computer-generated regression analysis of knee surgery Be sure to convert r2 data: NEVER use to decimal before 2 adjusted r ! taking the square Predictor Coef Stdev T P root! Constant 107.58What is 11.12 9.67 of0.000 the equation the What Age 0.8710are the0.4146 LSRL? 2.10 0.062 correlation coefficient Find the slope & y-intercept. and the coefficient of s = 10.42 R-sq = 30.6% R-sq(adj) = 23.7% determination? yˆ 107.58 .8710 x r .5532 Outlier – • In a regression setting, an outlier is a data point with a large residual Influential point• A point that influences where the LSRL is located • If removed, it will significantly change the slope of the LSRL Racket Resonance Acceleration (Hz) (m/sec/sec) 1 105 36.0 2 106 35.0 3 110 34.5 4 111 36.8 5 112 37.0 6 113 34.0 7 113 34.2 8 114 33.8 9 114 35.0 10 119 35.0 11 120 33.6 12 121 34.2 13 126 36.2 14 189 30.0 One factor in the development of tennis elbow is the impact-induced vibration of the racket and arm at ball contact. Sketch a scatterplot of these data. Calculate the LSRL & correlation coefficient. Does there appear to be an influential point? If so, remove it and then calculate the new LSRL & correlation coefficient. Which of these measures are resistant? • LSRL • Correlation coefficient • Coefficient of determination NONE – all are affected by outliers