Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Linear least squares (mathematics) wikipedia , lookup
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Chapter 13 Simple Linear Regression and Correlation: Inferential Methods Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. The first-year college grade point The equation forand an the additive Isaverage the first-year college grade grade point model high probabilistic school is: average solelyhave by the point determined average do NOT a highdeterministic school grade relationship. point average? y determinis tic function of x random deviation f (x ) e A relationship inthe which the valuebetween of y is A description of relationship Where e is an “error” variable completely determined by deterministically the value of an two variables that are not independent called a related can be variable given by xaisprobabilistic deterministic relationship. model. The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y a y a bx e Population regression line (slope b) e1 Without the random deviation e in e2 the equation, all observed (x, y) points would fall exactly on the population regression line. x1 x2 x Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular x value has mean value 0. that is, me = 0. 2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. 3. The distribution of e at any particular value of x is normal. 4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another. Let’s look at the heights and weights of a population of adult women. Weight How much Weights of women Are some of these would an that are 5 feet tall We want the weights more Where would adult What would will vary – infemale other standard This distribution likely than you expect the words, there is a weigh if she you expect deviations of ispopulation normally others? distribution of were 5 feet for other all these normal distributed. What would this weights for adult regression line tall? heights? females 5to distributions distribution towho be?are look tall. befeet the same. like? 60 60 62 62 60 64 64 Height 62 66 66 60 62 64 64 68 68 66 68 66 68 Basic Assumptions of the Simple Linear Regression Model Revisited 1. The distribution of e at anyof particular x The distribution y at value has any mean value 0. that particular valueis,ofmxe = 0. is normal. 2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. the variable 3.Remember The distribution of e ate any particular value is a measure of the of x is normal. For any particular x value, extent that individual the standard deviation of 4. The random deviations e , e , . . ., e y-values deviate from 1 2 n y equals the standard associated with different observations are the population deviation of e. regression independent ofline. one another. We use yˆ a bx to estimate the true population regression line. Sxy b = point estimate of b = Sxx a = point estimate of a = y - bx where Sxy xy x y and S n xx 2 x x2 n Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15 17 18 15 16 19 17 16 18 19 Baby’s Weight (g) The scatterplot shows a y 2289 3393 3271 2648 2897 3327 2970pattern 2535 3138 3573 linear and the spread in the y values 3500 appears to be similar across the range of x 3000 Sketch a scatterplot of these data. values. This supports 2500 the appropriateness of the simple linear 16 15 17 18 19 regression model. Mother’s Age (yrs) Birth Weight Continued . . babies increase The weight. of approximately 245.15 grams for each The following data is on x = maternal age (in years) and y = of grams). 1 year in the mother’s age. birth weightincrease of baby (in x y 15 17 18 15 16 17 16 18 19 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 yˆ 1163.45 245.15x Baby’s Weight (g) 19 What is the point estimate for the 1163.45 245.15(18) 3249 .25 grams mean weight of babies born to 18year-old mothers? This is the point This is also the 3500 estimate for the prediction of the 3000 meanofweight of baby all weight a single babies to 182500 born to aborn mother 18 year-old mothers. years of age. 15 16 17 18 Mother’s Age (yrs) 19 The statistic for estimating the variance s2 is SSResid s n 2 2 e where SS Resid y yˆ 2 Why n – 2? The estimate for the standard deviation s is Note we thatmust the degrees Since estimateof freedom with The subscript e reminds us both for aassociated and b in the 2 2 or s in simple estimating s s sewe reduce that we are estimating the e line, regression variance of regression the “errors” thelinear sample size n byis2or 2 Recall the coefficientresiduals. of r , is the df determination, =n-2 proportion of observed y variation that is attributed to the model relationship. Birth Weight Revisited . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15 y 2289 For mother’s age, the 17 a particular 18 15 16 19 17 16 18 19 typical 3393 3271 deviation 2648 2897 for 3327possible 2970 2535 3138 3573 Approximately 76% of the variability weights of babies is approximately observed weight of babies can be 231 grams. explained by this model. Baby’s Weight (g) se 205.308 3500 r 2 .76 3000 2500 15 16 17 18 Mother’s Age (yrs) 19 Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the Since b is statements almost always it following are unknown, true: mustvalue be estimated 1. The mean of b is b. from That is, mb = b. independently selected observations. 2. The of the statistic The standard slope b ofdeviation the least-squares line b is s for b. gives a points estimate b Sxx 3. The statistic b has a normal distribution (a Since s of is usually unknown, the estimated consequence the model assumption that the standard deviation of the statistic b is se random deviation e is normally distributed.) sb Sxx Confidence Interval for b When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form b (t critical value) sb where the t critical value is based on df = n – 2. Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race? The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995): x 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 The plot shows a linear pattern, and 72 the vertical spread of points does not appear to be changing over the range 67 Sketch a sample. scatterplot of x values in the If we assume that the distribution of errors for the data. 62 at any given x value is approximately normal, then the simple linear regression model seems appropriate. Ski Time (min) y 7.7 8 9 10 11 12 Treadmill Time (min) Biathletes Continued . . . x = treadmill exhaustion time y = ski time 8.4 8.7 9.0 x 7.7 9.6 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 2.3335 (2.26)(.591) (3.671, .999) Ski Time (min) 72 67 62 8 9 10 11 12 Treadmill Time (min) We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill exhaustion time Find a 95% confidence is interval between for 1 minute and the slope 3.7 ofminutes. the true regression line. Biathletes Continued . . . Partial Minitab Output Equation of Estimated y intercept sb = estimated estimated standard regression a b Ski time = 88.8 – 2.33 treadmill time deviation of Estimated slope b 2 line r 100×r (adjusted) is not 2 se used in simple Predictor Coef StDev T regression. P linear The regression equation is Constant 88.796 5.750 15.44 0.000 Treadmill -2.3335 0.5911 -3.95 0.003 S = 2.188 R-Sq = 63.4% 2 SSResid SSTo s e n-2 Analysis of Variance Source R-Sq (adj) = 59.3% DF SS MS F P Regression 1 74.630 74.630 15.58 0.003 Residual Error 9 43.097 4.789 10 117.727 Total Summary of Hypothesis Tests Concerning b Null hypothesis: H0: b = hypothesized value Test Statistic: b hypothesiz ed value t sb The test is based on df = n – 2. Alternative Hypothesis: P -value: Often the value hypothesized Ha: b > hypothesized area tovalue right ofis t under the t curve zero – this is calledappropriate the model utility test for simple Ha: b < hypothesized value area tolinear left of t under the appropriate t curve regression. Ha: b ≠ hypothesized value 2(area to right of t ) if +t or 2(area to left of t ) if -t Summary of Hypothesis Tests Concerning b Continued . . . Assumptions: For this test to be appropriate the four basic assumptions of the simple regression model must be met: 1. The distribution of e at any particular x value has a mean of 0 (me = 0), 2. The standard deviation of e is s, which does not depend on x. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, …, en associated with different observations are independent of one another. Weight What is the slope of a horizontal line? 60 62 64 Height Suppose the least-squares line is horizontal – would height be useful in predicting A slope of zeroweight? – means 66 68 that there is NO linear relationship between x and y! The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of The null hypothesis specifies that there is no useful linear relationship between x and H0: b = 0 y. Ha: b ≠ 0 Test Statistic: b t sb Biathletes Revisited . . . x = treadmill exhaustion time y = ski time 8.4 8.7 9.0 x 7.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 H0: b = 0 Ha: b ≠ 0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 Where b is the slope of the population regression line between treadmill time and ski time 2.3335 t 72 3.95 0.5911 Ski Time (min) P-value = the .003scatterplots Even though indicates a linear relationship a between = .05 ski time df and= 9 67 treadmill time, let’s perform Since the P-value < a, we reject H0. utility Theretest. is the model 62 sufficient evidence of a linear relationship between treadmill time and ski time. 9 8 10 11 12 Treadmill Time (min) Biathletes Revisited . . . Partial Minitab Output The regression equation is t test statistic P-value Ski time = 88.8 – 2.33 treadmill time Predictor Coef StDev T P Constant 88.796 5.750 15.44 0.000 Treadmill -2.3335 -3.95 0.003 S = 2.188 R-Sq = 63.4% ÷ 0.5911 = R-Sq (adj) = 59.3% Statistical Analysis of Variance Source Regression Residual Error Total software usually performs the test DF model utility SS MS with F P H01: b = 74.630 0 versus74.630 Ha: b ≠15.58 0 0.003 9 43.097 10 117.727 4.789 Checking Model Adequacy The simple linear regression model is y = a + bx + e where e represents the random deviation of an observed y value from the population regression line a + bx. If we knew the deviations of e1deviations , linear ethese However, we do not know the Therefore, we must 2, …, en, The assumptions forestimate simple for ewe eexamine population them forrandom any the deviations using the residuals from 1, ecould 2, …, n because regression are based onthe this inconsistencies regression with linemodel is unknown. estimated line. deviation Thus, we useassumptions. the residuals e. to check our assumptions. Residual Analysis • Standardize the residuals to look at their magnitudes residual standardized residual estimated standard deviation of residual • Create a residual plot (from Chapter 5) or a Any observation with a large positive or of standardized residual plot (which is a will plot Most statistical software residualresidual) should be examined the negative (x, standardized pairs) perform this calculation. It is carefullyplot for any error recording toinexhibits do by hand. A desirable istedious one that no data, particular nonstandard experimental condition, orspread pattern (such as curvature or much greater atypical in one part on theexperimental plot than the unit. other) and that has no point that is far removed from all the others. A Look at Standardized Residual Plots This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. Both of these plots contain points far plot awayexhibits a curved This In this plot, thethe standard deviation of the from others. pattern which indicates that residuals increases as the x-values increase. Thesethe points can fitted model should be While a straight-line model might still be have substantial changed to incorporate the appropriate, theeffects best-fit should be found online curvature. using weighted least-squares. Consult your estimates of a local statistician! and b as well as other quantities. Biathletes Revisited . . . r = residuals sr = standardized residuals (from Minitab) 7.7 y 71.0 r 0.17 sr 0.10 Ski Time (min) 72 67 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 The probability the 62.6 61.7 65.0 normal 68.7 64.4 69.4 63.0 plot 64.6 of66.9 standardized residuals is quite 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 straight. 2.37 -0.53 0.21 There is no0.44 reason doubt plausibility 1.13 -1.74 -0.96 to 1.44 -1.18 the -0.19 1.16 -0.27 0.12 that the random deviations e are normally distributed. Let’s look at a normal probability plot of the standardized 1 residuals 0 71.4 Standardized Residual x -1 62 -2 8 9 10 11 12 Treadmill Time (min) -2 -1 0 1 Normal Score 2 Biathletes Continued . . . r = residuals sr = standardized residuals (from Minitab) 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21 sr 0.10 -0.27 The standardized residual plot similar appearances. Sketch a y. residual plot. be does plotted against not show evidence of any Sketch a standardized pattern or of increasing spread. residual plot. 3 0.12 Notice these two have 1.13 -1.74that 0.44that -0.96 1.44 plots -1.18can -0.19 Remember residuals also1.16 2 1 Residuals Standardized Residuals x 0 -1 -2 1 0 -1 -2 -3 8 9 10 11 Treadmill Time 12 8 9 10 Treadmill Time 11 12 Optional Topics Inferences Based on the Estimated Regression Line and Inference about the Population Correlation Coefficient Properties of the Sampling Distribution of a + bx for a Fixed Value of x Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of The farther x*sis from the center, the by Since s is unknown, can be estimated the statistic a +bx* had the following properties: a+bx* larger sa+bx* is.place of s. s which substitutes s in a+bx* e so a + bx* is an 1) The mean value of a + bx* is a + bx*, unbiased statistic estimating the mean y value when x = x*. 2) The standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by sa bx * 1 (x * x )2 s n Sxx 3) The distribution of a + bx* is normal. Confidence Interval for a Mean y Value Because sa+bx* is larger the farther x* is from x, When the basic assumptions of the simple linear the confidence interval becomes wider as x* regression confidence movesmodel away are frommet, the acenter of theinterval data. for a +bx*, the mean y value when x has value x*, is a bx * (t critical value) sa bx * where the t critical value is based on df = n – 2. Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.) Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw width body length, which is Thisfrom scatterplot of the more easily measured. data shows a linear pattern and is consistent with use of the simple linear regression model. Jaws Continued . . . The regression equation is Jaw Width = 0.69 + 0.963 Length Predictor Coef StDev T P Constant 0.688 1.299 0.53 0.599 0.96345 0.08228 11.71 0.000 Length S = 1.376 R-Sq = 76.6% R-Sq (adj) = 76.0% The point estimate is Let’s use the data to compute a 90% The model utility test confirms The simple linear regression confidence for mean width a b (model 15 ) .688 interval .the 96345 (15 )ofthe 15.of 140 in. jaw usefulness this model. explains 76.6% the forin15jaw foot long sharks. variability width. The estimated standard deviation of a + b(15) is sa b (15) 1 (15 15.586)2 1.376 .213 44 279.8718 Jaws Continued . . . The regression equation is Jaw Width = 0.69 + 0.963 Length Predictor Coef StDev T P Constant 0.688 1.299 0.53 0.599 0.96345 0.08228 11.71 0.000 Length S = 1.376 R-Sq = 76.6% R-Sq (adj) = 76.0% The 90% confidence interval is a b (15) (t critical value) sa bx * 15.140 (1.68)(.213) (14.782, 15.498) Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between 14.782 and 15.498 inches. Prediction Interval for a Single y Value When the basic assumptions of the simple linear Themodel prediction interval is wider interval than thefor regression are met, a prediction The prediction interval and the confidence interval due to the due to the y*, a single y observation made when xat= x*, has confidence interval are centered addition of se under the square-root exactly the samesymbol. place, a + bx*. the form a bx * (t critical value) s s 2 e 2 a bx * where the t critical value is based on df = n – 2. Jaws Revisited . . . Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet. a b (15) .688 .96245(15) 15.140 se2 1.376 1.8934 2 Notice that this s .213 .0454 interval is much The 90% prediction interval is wider than the a b (15) (t critical value) se2 sa2confidence b (15) interval for the 15.140 (1.68) 1.9388 (12.801,mean 17.479jaw ) width. We can be 90% confident that an individual shark of length 15 feet will have a jaw width between 12.801 and 17.479 inches. 2 a b (15) 2 Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data. Notice that that the Also notice prediction the confidence interval interval is is very substantial narrow close to wider the x, butthan widens confidence the farther it is interval from the mean. A Test for Independence in a Bivariate Normal Population Null Hypothesis: H0: r = 0 Test Statistic: t r 1r2 Greek letter “rho” coefficient. r is the population correlation n 2 Many investigators are interested if ANY However, r = 0 is NOT A bivariate normal is one where for It assesses extent of any linear The test is based onequivalent df = nthe – population 2.between relationship exist x and y. That to x and y being any relationship fixed x value, the associated in thedistribution population.of r must be y is, are x and y are independent of each independent except in theycase values is normal, and for any fixed value, the between -1 and 1. other? of a bivariate normal Alternativedistribution Hypothesis: P-value: of x values is normal. population. Ha: rAn > 0 example (positive dependence) Area to thexright t would be the height and ofweight y Ha: r < 0 (negativeofdependence) the left of t American Area adulttomales. Ha: r ≠ 0 (dependence) 2(Area to the right of t) if +t or 2(Area to the left of t) if -t A Test for Independence in a Bivariate Normal Population Assumptions: r is the correlation coefficient for a random sample from a bivariate normal population. The one way to verify that the population is a bivariate normal population is to plot individual normal probability plots of the x and y variables. The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep (x, in hours) and blood leptin level (y) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = 0.11. Does this support the claim that short sleep duration is associated with reduced leptin? Use a = .01. Where r = the correlation between average nightly sleep and blood leptin level for the State the hypotheses. To verify theofassumptions, we would look at H a: r > 0 population adult Americans normal probability plots of the x values and of . 11 Test Statistic: 2.96 the y tvalues. However, data is not available, so 2 1 (.11 ) bivariate normal population we will assume the is reasonable. 714We will also assume that it is reasonable to regard the sample of participants as representative ofdfthe population of adult P-value = .0015 = 714 a = .01 Americans. H 0: r = 0 Sleepless Nights Continued . . . H 0: r = 0 H a: r > 0 Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans Test Statistic: P-value = .0015 t .11 1 (.11) 714 2 2.96 df = 714 a = .01 Note: the hypothesis of no linear relationship (H< b = we 0) reject can also used 0: .01, Since the P-value H0. be There is to test to forsuggest independence a positive bivariate evidence that therein is a normal apopulation. association (perhaps weak one since r = .11) between sleep duration and blood leptin level.