Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of statistics wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Chapter 13 Simple Linear Regression and Correlation: Inferential Methods Suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. The first-year college grade point The equation anthe additive probabilistic Is thefor first-year college grade average and high school grade model is: point average determined point average do NOT solely have aby the high school relationship. grade point deterministic y determinis tic function of x random deviation average? f (x ) e A description relationshipof in the which the value of y is A relationship between Where e is an determined “error” variable completely by the value of an two variables that are not deterministically independent called a related can bevariable given byxaisprobabilistic deterministic relationship. model. The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y a y a bx e Population regression line (slope b) e1 Without the random deviation e in e2 the equation, all observed (x, y) points would fall exactly on the population regression line. x1 x2 x Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular x value has mean value 0. that is, me = 0. 2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. 3. The distribution of e at any particular value of x is normal. 4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another. Let’s look at the heights and weights of a population of adult women. Weight How much Weights women Are of some of would an thatWe are 5 feet tall want the these weights Where would adult What would will vary – infemale other standard This distribution more likelyisthe than you expect words, there a weigh if she you expect deviations of all ispopulation normally others? distribution of were 5 feet for other thesefor normal distributed. What would this weights adult regression line tall? heights? females 5to distributions distribution towho be?are look feet tall. be the same. like? 60 60 62 62 60 64 64 Height 62 66 66 60 62 64 64 68 68 66 68 66 68 Basic Assumptions of the Simple Linear Regression Model Revisited 1. The distribution of e at anyofparticular x The distribution y at value has mean value 0. that any particular valueis,ofmex= 0. is normal. 2. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by s. the variable e any particular value 3.Remember The distribution of e at is a measure of the of x is normal. For any particular x value, extent that individual the standard deviation of 4. The random deviations e , e , . . ., e y-values deviate from 1 2 n y equals the standard associated with different observations are the population deviation of e. regression independent ofline. one another. We use yˆ a bx to estimate the true population regression line. Sxy b = point estimate of b = S xx a = point estimate of a = y - bx where Sxy xy x y and S n xx 2 x x2 n Let x* denote a specific value of the predictor variable x. Then a + bx* has two different interpretations: 1. It is a point estimate of the mean y value when x = x*. 2. It is a point prediction of an individual y value to be observed when x = x*. Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and mother’s age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15 17 18 15 16 19 17 16 18 19 Baby’s Weight (g) The scatterplot shows a y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 linear pattern and the spread in the y values 3500 appears to be similar acrossdata. the range of x 3000 Sketch a scatterplot of these values. This supports 2500 the appropriateness of the simple linear 16 15 17 18 19 regression model. Mother’s Age (yrs) Birth Weight Continued . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15 17 18 15 16 19 17 16 18 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 x 170 y 30,041 Summary 2 2 statistics x 2910 xy 515 , 600 y ,785,351 91 computed from 17030,041 4903.0 the sample S 515,600 n 10 xy 10 datathese are: Using The estimated summary regression line is: statistics y = -1163.45 + 245.15x 1702 Sxx 2910 20.0 10 4903.0 b 245.15 20.0 a 3004.1 (245.1)(17.0) 1163.45 Birth Weight The Continued . . babies increase weight. of approximately 245.15 grams for each The following data is on x = maternal age (in years) and y = increase of grams). 1 year in the mother’s age. birth weight of baby (in x 15 17 18 15 16 19 17 16 18 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 Baby’s Weight (g) yˆ 1163.45 245.15x What is the point estimate for the 1163.45 245.15(18) 3249 .25 grams mean weight of babies born to 18year-old mothers? This is the point This is also the 3500 estimate for the prediction of the 3000 mean of all weightweight of a single babies 182500 baby bornborn to a to mother year-old 18 yearsmothers. of age. 15 16 17 18 Mother’s Age (yrs) 19 The statistic for estimating the variance s2 is SSResid s n 2 2 e where SS Resid y yˆ 2 Why n – 2? The estimate for the standard deviation s is Note the degrees Since that we must estimateof freedom associated with The subscript e reminds us both for a and b in the 2 2 or s in simple estimating s s sewe reduce that we are estimating the regression line, e variance of regression the “errors” thelinear sample size n byis2or 2, is the Recall the coefficient of determination, r residuals. df = n - 2 proportion of observed y variation that is attributed to the model relationship. Birth Weight Revisited . . . The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). For a particular mother’s age, the 15 17 18 1576%16of the 19 variability 17 16 18 Approximately typical y 2289 3393 deviation 3271weight 2648 for 2897 3327 2970 observed ofpossible babies can 2535 be 3138 weights ofexplained babies isby approximately 2 this model. ˆ 426,762 SS Resid y y . 45 231 grams. SS To y y 2 1,780,322.9 x Baby’s Weight (g) se 19 3573 426,762.45 230.97 8 SSResid and 426to ,762 .45 2Findthis Use compute r 1 SSTo. .76 s1e,780 and,322 r2. .9 3500 3000 2500 15 16 17 18 Mother’s Age (yrs) 19 Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the Since b isstatements almost always it following are unknown, true: mustvalue be estimated 1. The mean of b is b. from That is, mb = b. independently selected 2. The standard deviation of b the b is observations. The slope ofstatistic the s a point least-squaressline gives b estimate for Sxxb. 3. The statistic b has a normal distribution (a Since sof is usually unknown, the estimated consequence the model assumption that standard deviation of the statistic b is the random deviation e issenormally sb distributed.) Sxx Confidence Interval for b When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form b (t critical value) sb where the t critical value is based on df = n – 2. Is cardiovascular fitness (as measured by time to exhaustion from running on a treadmill) related to an athlete’s performance in a 20-km ski race? The following data on x = treadmill time to exhaustion (in minutes) and y = 20-km ski time (in minutes) were taken from the article “Physiological Characteristics and Performance of Top U.S. Biathletes” (Medicine and Science in Sports and Exercise, 1995): x Ski Time (min) y 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 The plot shows a linear pattern, and 72 the vertical spread of points does not appear to be changing over the range 67 a scatterplot of xSketch values in the sample. If we assume that the distribution of for the data. 62 errors at any given x value is approximately normal, then the simple linear regression model seems 9 8 10 11 12 appropriate. Treadmill Time (min) Biathletes Continued . . . x = treadmill exhaustion time y = ski time 8.4 8.7 9.0 x 7.7 9.6 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 Ski Time (min) 2.3335 (2.26)(.591) (3.671, .999) 72 67 62 8 9 10 11 Treadmill Time (min) 12 We are 95% confident that the true average decrease in ski time associated with a 1 minute increase in treadmill Find a 95% confidence exhaustion time interval for theisslope between 1 minute and 3.7 of the true regression minutes. line. Biathletes Continued . . . Partial Minitab Output Equation of y Estimated sb = estimated standard estimated regression intercept a b Ski time = 88.8 – 2.33 treadmill time deviation of b Estimated slope 2 line r 100×r (adjusted) is not 2 se used in simple Predictor Coef StDev T regression. P linear The regression equation is Constant 88.796 5.750 15.44 0.000 Treadmill -2.3335 0.5911 -3.95 0.003 S = 2.188 R-Sq = 63.4% 2 SSResid SSTo s e n-2 Analysis of Variance Source R-Sq (adj) = 59.3% DF SS MS F P Regression 1 74.630 74.630 15.58 0.003 Residual Error 9 43.097 4.789 10 117.727 Total Summary of Hypothesis Tests Concerning b Null hypothesis: H0: b = hypothesized value b hypothesiz ed value Test Statistic: t sb The test is based on df = n – 2. Alternative Hypothesis: P -value: Often the hypothesized value Ha: b > hypothesized value area to right of t under the appropriate is zero – this is called thet curve model utility fortosimple Ha: b < hypothesized value test area left of t under the appropriate t curve linear regression. Ha: b ≠ hypothesized value 2(area to right of t ) if +t or 2(area to left of t ) if -t Summary of Hypothesis Tests Concerning b Continued . . . Assumptions: For this test to be appropriate the four basic assumptions of the simple regression model must be met: 1. The distribution of e at any particular x value has a mean of 0 (me = 0), 2. The standard deviation of e is s, which does not depend on x. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, …, en associated with different observations are independent of one another. Weight What is the slope of a horizontal line? 60 62 64 Height Suppose the least-squares line is horizontal – would height be useful in predicting A slope of zeroweight? – means 66 68 that there is NO linear relationship between x and y! The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of The null hypothesis specifies that there is no useful linear relationship H 0: b = 0 between x and y. Ha : b ≠ 0 Test Statistic: b t sb Biathletes Revisited . . . x = treadmill exhaustion time y = ski time 8.4 8.7 9.0 x 7.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 H0: b = 0 Ha: b ≠ 0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 Where b is the slope of the population regression line between treadmill time and ski time P-value = the .003scatterplots Even though indicates a linear relationship a between = .05 ski time df and= 9 67 treadmill time, let’s perform Since the P-value < a, we reject H0. utility There is the model test. 62 Ski Time (min) 2.3335 t 72 3.95 0.5911 sufficient evidence of a linear relationship between treadmill time and ski time. 9 8 10 11 12 Treadmill Time (min) Biathletes Revisited . . . Partial Minitab Output The regression equation is t test statistic P-value Ski time = 88.8 – 2.33 treadmill time Predictor Coef StDev T P Constant 88.796 5.750 15.44 0.000 Treadmill -2.3335 -3.95 0.003 S = 2.188 ÷ R-Sq = 63.4% 0.5911 = R-Sq (adj) = 59.3% Statistical Analysis of Variance Source Regression Residual Error Total software usually performs the test DF model utility SS MS with F P H01: b = 74.630 0 versus Ha: b ≠15.58 0 74.630 0.003 9 43.097 10 117.727 4.789 Checking Model Adequacy The simple linear regression model is y = a + bx + e where e represents the random deviation of an observed y value from the population regression line a + bx. If However, we knew we the do deviations not know the of edeviations ethese Therefore, we must 1,linear 2, …, en, The assumptions forestimate simple for we e1, e could examine en because them for population any the deviations using the residuals from 2, …, regression are based onthe this random inconsistencies regression with linemodel is e. unknown. assumptions. estimated line. deviation Thus, we use the residuals to check our assumptions. Residual Analysis • Standardize the residuals to look at their magnitudes residual standardized residual estimated standard deviation of residual • Create a residual plot (from Chapter 5) or a Any observation with a large positive or of standardized residual plot (which is a plot Most statistical software will residualthis should bepairs) examined the negative (x, standardized residual) perform calculation. It is carefullyplot for is any error recording tedious toinexhibits do by hand. A desirable one that no data, particular nonstandard experimental condition, or pattern (such as curvature or much greater atypical experimental unit. spread in one part on the plot than the other) and that has no point that is far removed from all the others. A Look at Standardized Residual Plots This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. Both of these plots contain points farplot away This exhibits a curved In this plot, the standard deviation of the frompattern the others. which indicates that residuals increases as the x-values increase. Thesethe points can fitted model should be While a straight-line model might still be have changed substantial to incorporate the appropriate, the best-fit line should be found effects oncurvature. using weightedestimates least-squares. of a Consult your local statistician! and b as well as other quantities. Biathletes Revisited . . . r = residuals y 71.0 r 0.17 sr 0.10 72 67 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 The probability the 62.6 61.7 65.0 normal 68.7 64.4 69.4 63.0 plot 64.6 of66.9 standardized residuals is quite 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 straight. 2.37 -0.53 0.21 There is no0.44 reason doubt 1.13 -1.74 -0.96 to 1.44 -1.18 the -0.19plausibility 1.16 -0.27 0.12 that the random deviations e are normally distributed. Let’s look at a normal probability plot of the standardized 1 residuals 0 71.4 Standardized Residual 7.7 Ski Time (min) x sr = standardized residuals (from Minitab) -1 62 -2 8 9 10 11 Treadmill Time (min) 12 -2 -1 0 1 Normal Score 2 Biathletes Continued . . . r = residuals sr = standardized residuals (from Minitab) 7.7 8.4 8.7 9.0 9.6 9.6 10.0 10.2 10.4 11.0 11.7 y 71.0 71.4 65.0 68.7 64.4 69.4 63.0 64.6 66.9 62.6 61.7 r 0.17 2.21 -3.49 0.91 -1.99 3.01 -2.46 -0.39 2.37 -0.53 0.21 sr 0.10 Notice these two have 1.13 -1.74that 0.44 -0.96 1.44 plots -1.18can -0.19 -0.27 Remember that residuals also1.16 plot The standardized residual similar appearances. Sketch a y. residual plot. bedoes plotted against not show evidence of any Sketch a standardized pattern or of increasing spread. residual plot. 3 1 2 0 1 0 Residuals Standardized Residuals x -1 -2 -1 -2 -3 8 9 10 11 Treadmill Time 12 8 9 10 11 Treadmill Time 12 0.12 Optional Topics Inferences Based on the Estimated Regression Line and Inference about the Population Correlation Coefficient Properties of the Sampling Distribution of a + bx for a Fixed Value of x Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of The farther x*sis from the center, theby Since s is unknown, can be estimated the statistic a +bx* had the following properties: a+bx* larger sa+bx* is.place of s. s which substitutes s in a+bx* e so a + bx* is an 1) The mean value of a + bx* is a + bx*, unbiased statistic estimating the mean y value when x = x*. 2) The standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by sa bx * 1 (x * x )2 s n Sxx 3) The distribution of a + bx* is normal. Confidence Interval for a Mean y Value Because sa+bx* is larger the farther x* is from When the basic assumptions of the simple linear x, the confidence interval becomes wider as x* regression modelfrom are met, a confidence moves away the center of the interval data. for a +bx*, the mean y value when x has value x*, is a bx * (t critical value) sa bx * where the t critical value is based on df = n – 2. Physical characteristics of sharks are of interest to surfers and scuba divers as well as to marine researcher. The data on x = length (in feet) and y = jaw width (in inches) for 44 sharks (were found in various articles appearing in the magazines Skin Diver and Scuba News. (These data are found on page 778 of the text.) Because it is difficult to measure jaw width in living sharks, researchers would like to determine whether it is possible to estimate jaw This widthscatterplot from body length, which is of the more easily measured. data shows a linear pattern and is consistent with use of the simple linear regression model. Jaws Continued . . . The regression equation is Jaw Width = 0.69 + 0.963 Length Predictor Coef StDev T P Constant 0.688 1.299 0.53 0.599 0.96345 0.08228 11.71 0.000 Length S = 1.376 R-Sq = 76.6% R-Sq (adj) = 76.0% The point estimate is use the data to compute a 90% The model utility test confirms The Let’s simple linear regression interval thethis mean jaw usefulness a b model (15) confidence .688 the .96345 (15)offor 15 .of 140 in . model. explains 76.6% the width in for 15width. foot long sharks. variability jaw The estimated standard deviation of a + b(15) is sa b (15) 1 (15 15.586)2 1.376 .213 44 279.8718 Jaws Continued . . . The regression equation is Jaw Width = 0.69 + 0.963 Length Predictor Coef StDev T P Constant 0.688 1.299 0.53 0.599 0.96345 0.08228 11.71 0.000 Length S = 1.376 R-Sq = 76.6% R-Sq (adj) = 76.0% The 90% confidence interval is a b (15) (t critical value) sa bx * 15.140 (1.68)(.213) (14.782, 15.498) Based on these sample data, we can be 90% confident that the mean jaw width for sharks of length 15 feet is between 14.782 and 15.498 inches. Prediction Interval for a Single y Value When the basic assumptions of the simple linear The model prediction interval is wider than thefor regression are met, a prediction interval The prediction interval andthe thedue to the confidence interval due to y*, a confidence single y observation made when xat = x*, has interval are centered addition of se under the square-root exactly the samesymbol. place, a + bx*. the form a bx * (t critical value) s s 2 e 2 a bx * where the t critical value is based on df = n – 2. Jaws Revisited . . . Suppose that we were interested in predicting the jaw width of a single shark of length 15 feet. a b (15) .688 .96245(15) 15.140 se2 1.376 1.8934 2 Notice that this s .213 .0454 interval is much The 90% prediction interval is wider than the 2 a b (15) (t critical value) se2 saconfidence b (15) interval for the 15.140 (1.68) 1.9388 (12.801,mean 17.479jaw ) width. We can be 90% confident that an individual shark of length 15 feet will have a jaw width between 12.801 and 17.479 inches. 2 a b (15) 2 Below is a Regression Plot from Minitab showing the confidence interval and the prediction interval for the shark data. Notice that that the Also notice prediction the confidence interval interval is is very substantial narrow close to wider the x, butthan widens confidence the farther it is interval from the mean. A Test for Independence in a Bivariate Normal Population Null Hypothesis: H0: r = 0 Test Statistic: t r 1r2 Greek letter “rho” coefficient. r is the population correlation n 2 Many investigators are interested if ANY However, r = 0 is NOT A relationship bivariate normal population is one where for assesses the extent of any linear The test is It based on df = n – 2. exist between y. That equivalent to x andx yand being any fixed value, the distribution of associated relationship in the population. r must be is, arex x and y are independent of each except in the ycase y values isindependent normal, and for any fixed value, the between -1 and 1. other? of a of bivariate normal Alternativedistribution Hypothesis: xP-value: values is normal. population. Ha: r An > 0 example (positive dependence) Area to thex right of t would be the height and weight y Ha: r < 0 (negativeof dependence) the left of t AmericanArea adulttomales. Ha: r ≠ 0 (dependence) 2(Area to the right of t) if +t or 2(Area to the left of t) if -t A Test for Independence in a Bivariate Normal Population Assumptions: r is the correlation coefficient for a random sample from a bivariate normal population. The one way to verify that the population is a bivariate normal population is to plot individual normal probability plots of the x and y variables. The relationship between sleep duration and the level of the hormone leptin ( a hormone related to energy intake and energy expenditure) in the blood was investigated. Average nightly sleep (x, in hours) and blood leptin level (y) were recorded for each person in a sample of 716 participants in the Wisconsin Sleep Cohort Study. The sample correlation coefficient was r = 0.11. Does this support the claim that short sleep duration is associated with reduced leptin? Use a = .01. Where r = the correlation between average nightly sleep and blood leptin level for the State the hypotheses. To verify the assumptions, we would look at Ha: r > 0 population of adult Americans normal probability plots of the x values and of . 11 Test Statistic: the ytvalues. However, data is not available, so 2.96 2 1 (.the 11) bivariate normal population is we will assume reasonable. We will also assume that it is 714 reasonable to regard the sample of participants as representative ofdfthe population ofaadult P-value = .0015 = 714 = .01 Americans. H0: r = 0 Sleepless Nights Continued . . . H0: r = 0 Ha: r > 0 Where r = the correlation between average nightly sleep and blood leptin level for the population of adult Americans Test Statistic: P-value = .0015 t .11 1 (.11) 714 2 2.96 df = 714 a = .01 Note: the hypothesis of no linear relationship (H0<:.01, b =we 0)reject can also used Since the P-value H0. be There is to test to for independence evidence suggest that thereinis aa bivariate positive normal apopulation. association (perhaps weak one since r = .11) between sleep duration and blood leptin level.