* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Review Linear Regression t-tests Name
Survey
Document related concepts
Transcript
Review Linear Regression t-tests Source:Devore Name_____________________________ Class period_____ For questions #1-3, complete the following. a) Make a scatterplot and check for any obvious outliers. b) Find the equation of the least squares regression line. c) Calculate r and r2 and explain what they mean. d) sketch the residual plot and determine if a linear equation is appropriate e) Interpret the slope and y-intercept in the regression equation. f) Predict the y value for the given x value. g) Test at the specified significance level if is different from zero. h) Find sb. i) Construct the specified confidence interval for . 1. The accompanying data on fish survival and ammonia concentration is taken from the paper “Effects off Ammonia on Growth and Survival of Rainbow Trout in Intensive Static Water Culture” . Let x = ammonia exposure (mg/L) and y = percent survival. x y 10 85 f) let x = 30 10 92 20 85 20 96 g) use = .05 25 87 27 80 27 90 31 59 50 62 i) 95% confidence interval 2. The paper “Root Dentine Transparency: Age Determination of Human Teeth Using Computerized Densitometric Analysis” (Amer. J. Phys. Anthr. (1991):25-30) reported on an investigation of methods for age determination based on tooth characteristics. With y=age (years) and x = percentage of root with transparent dentine. The accompanying data is for anterior teeth: x y 15 23 f) let x = 25 19 52 31 65 39 55 g) use = .02 41 32 44 60 47 78 48 59 55 61 65 60 i) 98% confidence interval 3. The heights (in inches) are listed for boys who were measured at 5 years of age and again at 18 years of age. Age 5 43 41 42 45 46 Age 18 68 66 66 70 71 f) let x = 44 inches g) use = .05 i) 95% confidence interval 4.a) What does a correlation coefficient of 1 mean? of 2 mean? of 0 mean? b) Describe what finding the least squares regression line means. 5. a)The equation of a least squares regression line is ŷ = 2.4 + 3.7x. What is the residual for the point (3, 12)? What does this tell us about the actual point? b) What is the actual y value for x = 5 if the residual is 1.9? 6. A study concerning the relationship between the average ozone level y (in parts per million) and the population of a city x (in millions) resulted in the following MINITAB output. Predictor Constant popn Coef 8.892 16.650 s = 5.454 Stdev 2.395 1.910 R-sq = 84.4% t-ratio 3.71 8.72 P 0.002 0.000 R-sq(adj) = 83.3% Analysis of Variance SOURCE Regression Error Total DF 1 14 15 SS 2260.5 416.4 2676.9 MS 2260.5 29.7 F 76.00 P 0.000 a) What is the equation of the least squares regression line? b) Estimate the mean ozone level for a city having population of 1 million people. c) Determine the proportion of the observed variation in ozone levels that can be attributed to the population of a city. d) What is the correlation coefficient? f) Is the regression equation useful in predicting the average ozone level using the population of the city? Support your answer. 7. What is the difference between an influential point and an outlier? 8. Suppose a simple linear regression model is appropriate for describing the relationship between x = the angle formed by the point of an elf’s ear and y = the elf’s IQ rating. The true regression equation is believed to be y = 163 – .62x and = 5. What proportion of elves with a 20˚ ear point angle have an IQ above 159? (Elves are very smart!) (pjs) Multiple Choice Practice 9. A bivariate set of data relates the amount of annual salary raise and previous performance rating. The least squares regression equation is ŷ = 1400 + 2000x where ŷ is the estimated raise and x is the performance rating. Which of the following statements is not correct? (from AMSCO’s AP Stat pg 70 #2) A) For each increase of one point in performance rating, the raise will increase on average by $2,000. B) A rating of 0 will yield a predicted raise of $1,400. C) The correlation coefficient of the data is positive. D) All of the above are false. 10. Which of the following is not a consideration in determining the goodness of fit of a model? (from AMSCO’s AP Stat pg 265 #13) A) the value of r2 B) the slope of the residual plot C) the existence of influential points D) the existence of pattern in the residual plot 11. Suppose a data set is transformed using (x,y) → (x, lny) and a least squares linear regression procedure is performed on the transformed data. If the residual plot of this regression shows a curved pattern, which of the following is an appropriate conclusion? (from AMSCO’s AP Stat pg 305 #12) A) A quadratic model should be used with the original data. B) A square root transformation should be applied to the transformed data. C) The correlation of the set of transformed data is 0. D) The exponential transformation is not appropriate. E) None of these is appropriate. 12. The monthly cost of a long distance phone call depends on many factors. A least squares regression relating cost to time on line determines an equation of cost = 4.70 + 0.15(time in minutes). Which is not correct? (from AMSCO’s AP Stat pg 305 #14) A) Long distance service is predicted to cost 15 cents per minute. B) The monthly fixed cost of long distance service is, on average, $4.70. C) Adding another phone will raise cost 0.15. D) Five dollars will cover the long distance charges if you’re only call is for two minutes. 13. A value in a bivariate set of data is an influential point if ? (from AMSCO’s AP Stat pg 305 #15) A) it has a large residual. B) it is an outlier with respect to the values of the explanatory variables. C) its removal creates a substantial change in r. D) its removal changes the sign of r. E) None of these. For each data set below (14-16), find a) the best transformed linear equation (unless quadratic), b) the equation in y= form, and c) make the prediction for the given x value. If the scatterplot appears quadratic, simply use the quadratic regression function in your calculator. (pjs) 14. x .1 .2 .3 .6 .8 1.5 2 y 1 1.3 1.5 1.8 1.9 2.2 2.3 predict y when x = 1 15. x .2 y 12 predict y when x = 5 1 9.1 2 5.9 3 4.8 4 6 6 15 16. x 2 y 11 predict y when x = 10 7 6.7 12 4 16 2.6 21 1.5 30 .6 40 .2 17. x 1.8 y .7 predict y when x = 2 3 5.8 4 6 6 10 8 35 7 21 Answers residual 1. b) ŷ = .778x + 100.791 c) r = .727 suggests a strong, negative, linear correlation between ammonia exposure and survival. r2 = .528 states that 52.8% of the total variation in survival percentage is explained by the amount of ammonia exposure. d) the residual plot has no pattern and verifies a linear relationship e) the slope: .778 means that for every increasing mg of ammonia exposure, .778 percent of the trout survive. y-intercept: 100.791 means that if the ammonia exposure is zero, 100% of the trout would survive. Residual Plot f) 77.46 15 g) conditions: *The scatterplot appears linear. 10 5 Residual plot appears random. 0 10 20 30 40 50 60 -5 0 * Independence is reasonable since the survival of one -10 -15 fish should not depend on another. -20 *The distribution of residuals is close enough to ammonia exposure normal to continue. (Histogram must have scales and labeled axes.) *The spread about the line is nearly constant to insure equal variance of y at each x value. residuals Ho: = 0 t = 2.80 df = 7 critical values: t = 2.365, 2.365 Ha: 0 p-value = .0265 se = 9.487 reject Ho since p-value < α We have enough evidence to claim that the slope of the linear regression equation is not zero and the equations appears to be useful in predicting survival using ammonia exposure. The percent of rainbow trout surviving decreases as the amount of ammonia exposure increases. h) .278 i) ( 1.435, .121) 2. b) ŷ = .555x + 32.081 c) r = .535 suggests a weak positive linear relationship between age and dentine transparency. r2 = .286 states that 28.6% of the total variation in age is associated with the variation in the amount of root with transparent dentine. d) the residual plot has no pattern and verifies a linear relationship e) slope: .555 means that for each percentage point of root with transparent dentine, the age increases .555 years. y-intercept: 32.081 means that if there is no root with transparent dentine, the age of the person was 32 years. f) 45.954 residuals Residual plot g) conditions: *The scatterplot appears linear. Residual plot appears random. * Independence is reasonable since the age of one person based on teeth doesn’t affect another’s age. *The spread about the line is nearly constant to insure about an equal variability of y at each x value. No fanning effect noticed. (continued on next page) 40 20 0 -20 0 -40 20 40 % of root 60 80 *The distribution of residuals is not so far from normal to completely thwart our normal condition. Axes must have labels and scale. (window set with x-scale = 4) Ho: = 0 t = 1.790 df = 8 critical values: t = 2.896, 2.896 Ha: 0 p-value = .111 se = 14.299 fail to reject Ho since p-value > α We do not have enough evidence to claim the slope of the linear regression equation is not zero. The linear regression line does not appear to be a useful way to predict age based on percentage of root with transparent dentine. h) .310 i) ( .343, 1.453) Note: zero is part of the interval 3. b) ŷ = 1.081x + 21.267 c) r = .983 suggests a strong positive linear correlation between age 5 height and age 18 height r2 = .967 states that 96.7% of the total variation in height at age 18 is associated with the variation of height at age 5. d) The residual plot has no pattern and verifies a linear relationship although there are so few points that a pattern may be present. (I just tried to save you the time of entering in more data.) 3e) slope: 1.081 means that the rate of growth is 1.081 inches for each inch at age 5 y-intercept: 21.267 is how tall a child will be at 18 years if they are 0 inches tall at 5 years. This doesn’t really make sense in this problem. Note that most babies are about 21 inches long at birth. f) 68.8 inches g) conditions: *The scatterplot is linear. Residual plot appears random. Not really so conclusion may be suspect. * Independence is reasonable since the height of one child should not depend on another. *The spread about the line is nearly constant to insure about equal variability of y at each x value. No fanning effect. *The distribution of residuals is close enough to normal to continue. Axes must have labels and scale. (window: x-scale = .2) Ho: = 0 t = 9.378 df = 3 critical values: t = 3.182, 3.182 Ha: 0 p-value = .00257 se = .4782 reject Ho since p-value < α We have enough evidence to claim that the slope of the linear regression equation is not zero. The linear regression equation appears to be useful in predicting male height at 18 years given male height a 5 years. The height of a male at age 18 years increases as their height at age 5 years increases. Of course, the equation becomes useless after that point (extrapolation). h) .1153 i) (.714, 1.448) 4a) A correlation coefficient of 1 or 1 means the points form a straight line. A correlation coefficient of 2 makes no sense since r is always between −1 and 1(inclusive). A correlation coefficient of zero would indicate no linear relationship between the variables. b) To find the least squares regression line, one would calculate the deviation score from each point to the regression line (residual), square it and the find the sum of all the squares. Any line drawn between the points would give a different sum. The line of best fit would be the one with the smallest (least) sum of the squared deviation (residual) scores. 5. a) 1.5; The point is below the linear regression line. b) 22.8 6. a) ŷ = 8.89 + 16.6 x b) 25.59 c) .844 d) .919 d) yes, since the p-value = 0, we would reject Ho which supports the usefulness of the regression equation 7. An influential point is horizontally separated from the bulk of the data and strongly influences the slope of the line and the correlation coefficient. An outlier is separated from the bulk of the data vertically. 8. 0.0465 9. D 10. B 11. D 12. C 14. a) ŷ = 2.01 + .437(lnx) 15. a) nope, it’s quadratic 13. B and maybe C b) ŷ = 2.01 + .437(lnx) c) 2.01 b) ŷ = 1.0268x2 – 5.93x + 13.511 16. a) ln ŷ = –.106x + 2.635 b) ŷ = 13.942 (.9)x c) 9.53 c) 4.861 17. a&b) I tried taking the ln of both x and y for a power function but it didn’t help much so I’m thinking a piecewise function as follows: if x<6 then ŷ = 2.0475x – 1.951 ; if x≥6 then ŷ = 12.5x – 65.5. I used the point (6,10) in both equations. So ŷ = 2.144 when x = 2. Nonlinear Notes Exponential: (x, y) → (x, ln y) Logarithmic: (x, y) → (ln x, y) Power: (x, y) → (ln x, ln y) Formulas for Inference with Slope b t Confidence Interval: b t * sb sb