Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Chapter 16 Understanding Relationships โ Numerical Data Part 2 Created by Kathy Fritz The Simple Linear Regression Model You might convert x = temperature in degrees centigrade to y = temperature in degrees Fahrenheit using ๐ฆ = 9 ๐ฅ 5 + 32. Suppose you want to convert 20หC into Fahrenheit. Temperature in Fahrenheit 20หC = 68หF 100 80 60 40 20 10 20 30 40 50 Temperature in centigrade This is a deterministic relationship. The value of the independent variable (centigrade temperature) is all that is needed to determine the value of the dependent variable (Fahrenheit temperature). Now suppose we were to investigate the relationship between y = the first-year college grade point average and x = high school grade point average. Is the first-year college grade point The first-year college average determined solelygrade by thepoint high The equation for grade aand probabilistic modelgrade is: point average the high school school point average? Explain. average do NOT have a deterministic relationship. ๐ฆ = ๐๐๐ก๐๐๐๐๐๐๐ ๐ก๐๐ ๐๐ข๐๐๐ก๐๐๐ ๐๐ ๐ฅ + ๐๐๐๐๐๐ ๐๐๐ฃ๐๐๐ก๐๐๐ =๐ ๐ฅ +๐ A description of the relationship between two variables that are not deterministically related can Where e is an โerrorโ variable be given by a probabilistic model. The simple linear regression model assumes that there is a line with y-intercept a and slope b, called the population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y ๏ฝ a ๏ซ bx ๏ซ e y a Populationeregression Without the random deviation in the e1 line (slope b) equation, all observed (x, y) points would e2 fall exactly on the population regression line. x1 x2 x 50 40 30 20 10 Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular value Before actually observe a value of y for any of x is you normal. particular value of x, you are uncertain about the value of e (random deviation from the regression line). It could be positive, negative, or even 0. The linear regression model makes some assumptions about the distribution of e at any particular x value in the population. x1 x2 x3 Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular value Because the values of e can be negative or positive, of x is normal. the sum of the values of e at any particular x value 2. The distribution of e at anymeparticular x will be zero. Thus, = 0. value has mean value 0. That is, me = 0. x1 x2 x3 Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular value of x is normal. 2. The distribution of e at any particular x value has mean value 0. That is, me = 0. 3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se. x1 x2 x3 Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular value of x is normal. 2. The distribution of e at any particular x value has mean value 0. That is, me = 0. 3. The standard deviation of e is the same for any particular value of x. This standard deviation is denoted by se. 4. The random deviations e1, e2, . . ., en associated with different observations are independent of one another. regression passes Thus The the population slope b is the mean orline expected through means of thea y1 values. change in y the associated with unit increase in x. y a + bx3 a + bx2 a + bx1 se is the The standard same for deviation of yany for particular any fixed x value ofvalue x* is also se The mean of y values at a fixed value x* is x y = aJust + bx*as there is variability in the values of e x x2 x3 at1 any particular value of x, there is also variability in the y values. Another look at se The smaller se, the closer the points are to the regression line. The larger se, the farther the points are from the regression line. The estimates of the slope and the y intercept of the population regression line are the slope and y intercept, respectively, of the least squares line, ๐ฆ = ๐ + ๐๐ฅ . ๐ = estimate of ๐ฝ = ๐ฅโ๐ฅ ๐ฆโ๐ฆ ๐ฅโ๐ฅ 2 ๐ = estimate of ๐ = ๐ฆ โ ๐๐ฅ The values of a and b are usually obtained using Let x* denote a software specifiedor value of the independent statistical a graphing calculator. variable x. Then a + bx* has two different interpretations: 1. It is a point estimate of the mean y value when x = x*. 2. It is a point prediction of an individual y value to be observed when x = x*. Medical researches have noted that adolescent females are much more likely to deliver low-birth-weight babies than are adult females. Because low-birth-weight babies have higher mortality rates, a number of studies have examined the relationship between birth weight and motherโs age for babies born to young mothers. The following data is on x = maternal age (in years) and y = birth weight of baby (in grams). x 15 17 18 15 16 The 19 scatterplot 17 16 shows 18 a linear 19 pattern and the spread in the y values appears to be similar across the range of x values. This supports the Sketch a scatterplot of appropriateness these data. of the simple linear regression model. Babyโs Weight (g) y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 3500 3000 2500 15 16 17 18 Motherโs Age (yrs) 19 Birth Weight Continued . . babies increases The weight. of approximately 245.15 grams for each The following data is on x = maternal age (in years) and y = increase of 1 year in the motherโs age. birth weight of baby (in grams). x 15 17 18 15 16 19 17 16 18 19 y 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573 WhatThat is the point yหBeware ๏ฝ ๏ญ1163 245.15 of.45 the๏ซdanger ofx extrapolation. Babyโs Weight (g) estimate for the mean is,๏ฝbe careful when trying to make an estimate ๏ญ1163.45 ๏ซ 245.15(18) ๏ฝ 3249 .25 grams weight of babies born or prediction for any x value much outside the todata. 18-year-old range of the observed x values in the mothers? This This is also the prediction is the point 3500 of the weightfor of the a single estimate 3000 baby bornweight to a mother mean of all 18 years of age. 2500 babies born to 18year-old mothers. 15 16 17 18 Motherโs Age (yrs) 19 The statistic for estimating the variance ๐๐2 is SSResid s ๏ฝ n ๏ญ2 2 e TheThe subscript a reminder value ofโeโ se,isthe estimated standard deviation where 2 that you estimating the about the are population regression line, is ห SS Resid ๏ฝ ๏ฅ y ๏ญ y interpreted as variance ofamount the โerrorsโ or an observation deviates the typical by which residuals. from the population regression line. ๏จ ๏ฉ The estimate of se is the estimated the standard Note that the degrees of deviation freedom2associated with sestimating se ๐๐2 or ๐๐ in simple e ๏ฝ linear regression is df = n - 2 Recall, the coefficient of determination, r2, is the proportion of variability in y that can be explained by the approximate linear relationship between x and y. How do we know if the estimated regression equation will be useful model for predicting y values from x? The residual plot and the values of se and r2 can be used to determine the estimated regression equationโs usefulness. Wildlife biologists monitor the ecological health of the Rocky Mountain elk. The equipment, manpower, and time to make direct measurement of the elk weights are difficult and expensive. Biologists found that they could reliably estimate the weight of an elk by measuring the chest girth and then using linear regression to estimate the weight. They measured the chest girth and weight of 19 Rocky Mountain elk. There appears to be a strong positive linear relationship between the chest girth and weight of elk. Elk Weight Problem Continued . . . Partial Minitab regression output is shown below. The regression equation is Weight = -136 + 2.81 Girth Predictor Coef SE Coef T P Constant -135.51 35.75 -3.79 0.001 Girth 2.8063 0.2686 10.45 0.000 S = 23.6626 R-Sq = 86.5% R-Sq(adj) = 85.7% This is the estimated regression equation. the observed TheApproximately magnitude of a86.5% typicalofdeviation from variation the leastelkisweight be attributed linear small squares in line about can 23.6626 kg, whichto is the relatively relationship between chest girth. in comparison to the y valuesweight (shownand in the scatterplot). Inferences Concerning the Slope of the Population Regression Line Properties of the Sampling Distribution of b When the four basic assumptions of the simple linear regression model are satisfied, the following statements are true: Since b is value almost 1. The mean ofalways b is b. unknown, That is, mitb must = b, so the be estimated from independently selected Since sb is distribution usually unknown, the estimated standard sampling of b is centered at the value of b. observations. The slope b of the leastdeviation of the statistic b is ๐ ๐ squares line gives a point estimate for bb. is 2. The standard deviation of the statistic ๐ ๐ = ๐ฅ๐โ๐ ๐ฅ 2 ๐๐ = 2 ๐ฅ โ ๐ฅ When the four basic assumptions of the simple linear model are satisfied, the probability distribution of the ๐โ๐ฝ 3. The statistic b has a normal distribution (a standardized variable ๐ก = is the t distribution with ๐ ๐ assumption that the random consequence of the model df deviation = (n - 2). e is normally distributed.) Confidence Interval for b When the four basic assumptions of the simple linear regression model are satisfied, a confidence interval for b, the slope of the population regression line, has the form ๐ ± (๐ก critical value)๐ ๐ where the t critical value is based on df = n โ 2. The dedicated work of conservationists for over 100 years has brought the bison in Yellowstone National Park from near extinction to a herd of over 3000 animals. It is important to monitor and manage the size of the bison population. Researchers have studied a number of environmental factors to better understand the relationship between bison reproduction and the environment. One factor thought to influence reproduction is stress due to accumulated snow, which makes foraging more difficult for the pregnant bison. Data from 1981-1997 on y = spring calf ratio (SCR) and x = previous fall snow-water equivalent (SWE) are shown on page 750. The researchers were interested in estimating the mean change in spring calf ratio associated with each additional cm in snow-water equivalent. Bison Population Problem Continued . . . Step 1 (Estimate): The value of b, the mean increase in spring calf ratio for each additional 1 cm of snow-water equivalent, will be estimated. Step 2 (Method): Because the answers to the four key questions are estimation, sample data, two numerical values, and one sample, a confidence interval for b, the slope of the population regression line, will be considered. A 95% confidence level will be used. Bison Population Problem Continued . . . Step 3 (Check): โข You will need to assume that these 17 years are representative of yearly circumstances at Yellowstone and that each yearโs reproduction and snowfall is independent of previous years. โข A scatterplot of the data looks linear and the spread does not seem different for different values of x. โข Because the boxplot of the residuals is approximately symmetrical and there are no outliers, it is reasonable to think that the distribution of e is approximately normal. Bison Population Problem Continued . . . Step 4 (Calculate): JMP regression output is shown here: Linear Fit SCR = 0.2606561 โ 0.0136639*SWE Summary of Fit RSquare 0.257644 Rsquare Adj 0.208153 Root Mean Square Error 0.033513 Mean of Response 0.209412 Observations 17 df = 17 โ 2 = 15 The t critical value for a 95% confidence level and df = 15 is 2.13. b ± (t critical value) sb = -0.0137 ± (2.13)(0.005989) = (-0.265, -0.0009) Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 0.206561 0.023885 10.91 <.0001* SWE -0.013664 0.005989 -2.28 0.0375* Slope b sb Bison Population Problem Continued . . . Step 5 (Communicate Results): Confidence Interval: You can be 95% confident that the true average change in spring calf ratio associated with an increase of 1 cm in the snow-water equivalent is between -0.0265 and -0.0009. Confidence level: The method used to construct this interval estimate is successful in capturing the actual value of the slope of the population regression line about 95% of the time. Summary of Hypothesis Tests Concerning b Appropriate when the four basic assumptions of the simple regression model are reasonable: 1. The distribution of e at any particular x value has a mean of 0 (me = 0). 2. The standard deviation of e is se, which does not depend on x. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, โฆ, en associated with different observations are independent of one another. Summary of Hypothesis Tests Concerning b Continued . . . When these conditions are met, the following test statistic can be used: ๐ โ ๐ฝ0 ๐ก= ๐ ๐ where b0 is the hypothesized value from the null hypothesis. Form of the null hypothesis: H0: b = b0 When the assumptions of the simple linear model are reasonable and the null hypothesis is true, the t test statistic has a t distribution with df = n โ 2. Summary of Hypothesis Tests Concerning b Continued . . . Associated P-Value: When the alternative hypothesis is . . . The P-value is . . . Ha: b > b0 area to right of t under the appropriate t curve Ha: b < b0 area to left of t under the appropriate t curve Ha: b โ b0 2(area to the right of t) if t is positive or 2(area to the left of t) if t is negative Inference for a population slope generally focuses on two questions: (1) What are plausible values for the population slope? (2) Is the population slope different from zero? This question can be addressed by When theThis null question hypothesis the canHbe 0: b = 0 is true, calculating a population regression line isthe a horizontal line. interval. answered by using confidence = ๐ผ + ๐ฝ๐ฅ + ๐ hypothesis ๐ฆtesting = a๐ผnull + 0๐ฅ + ๐ procedure with =๐ผ+๐ hypothesis H0: b = 0 If b test is in of fact equal to This H0: b =0,0knowledge versus Ha:of b โ x 0will be no use โ it will have is called theof model utility test for no regression. โutilityโ for simple linear predicting y. The Model Utility Test for Simple Linear Regression The model utility test for simple linear regression is the test of H0: b = 0 versus Ha: b โ 0 The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y. If H0 is rejected, you can conclude The test statistic is the ratio: linear regression that thetsimple model is useful for predicting y. ๐โ0 ๐ ๐ก= = ๐ ๐ ๐ ๐ When you hear a song on your car radio, you probably remember title of the song, the artist, and even when the song was released. An investigator wants to study this phenomenon. He compiled a list of songs from Rolling Stone, Billboard, and Blender lists of songs plus some recent songs familiar to college students. Twenty-three college students were then exposed to 56 clips of songs. Most of these students had had musical training, and they listened to popular music for an average of 21.7 hours per week. After hearing three short clips from a song (only 400 ms in Letโs perform a model utility test duration), the students were asked in what year each of to answer this question. the songs was released. The accompanying data show the actual release year and the average of the release years given by the students. Is there a relationship between the judged and actual release year for these songs? Song Recognition Problem Continued . . . Step 1 (Hypotheses): H0: b = 0 Ha: b โ 0 where b is the slope of the population regression line of the judged release year and the actual year Step 2 (Method): Because the answers to the four key questions are hypothesis testing, two numerical variables in a regression setting, and one sample, a hypothesis test for the slope of a population regression line will be considered. A significance level of 0.05 will be used. Song Recognition Problem Continued . . . Step 3 (Check): For this example you can assume that the assumptions are reasonable and proceed with the model utility test. (We will see how to check if the four assumptions of the simple linear regression model are reasonable in the next section.) Song Recognition Problem Continued . . . Step 4 (Calculate): JMP regression output is shown here: Linear Fit Judged Release = 1095.1525 + 0.449281*Actual Release Summary of Fit RSquare 0.771 Rsquare Adj 0.766759 Root Mean Square Error 3.59844 Mean of Response 1986.013 Observations ๐ โ 0 0.449 โ 0 ๐ก= = = 13.48 ๐ ๐ 0.0333 P-value = 2P (t > 13.48) โ 0 56 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 1095.1525 66.07159 16.58 <.0001* SWE 0.449281 0.033321 13.48 <.0001* Slope b sb Song Recognition Problem Continued . . . Step 5 (Communicate Results): Because the P-value is less than the selected significance level, the null hypothesis is rejected. Decision: Reject H0 Conclusion: The sample data provide convincing evidence that there is a useful linear relationship between the actual release year and the judged release year. Checking Model Adequacy Checking Model Adequacy The simple linear regression model is y = a + bx + e where e represents the random deviation of a y value from the population regression line a + bx. methods, include: confidence interval for slope and the TheseThe assumptions utility test, require assumptions about 1. At model any particular x value, thesome distribution of e is the random deviations in the simple linear regression normal. model be met xinvalue, orderthe forstandard inferencedeviation to be valid. 2. At any particular of e is se, which is constant over all values of x (that is, se does not depend on x). Residual Analysis If the deviations e1, e2, . . . , en from the population line were available, they could be examined for any inconsistencies with model assumptions. However, these deviations are Any observation e1 =that y1 โ (gives a + bxa1)large positive or negative residual shouldโฎ be examined carefully for any unusual circumstances, en = yn โ (a such + bxn)as a recording error or nonstandard experimental condition. Instead, diagnostic checks be based on the residuals These values of e MUST can ONLY be calculated ๐ฆ1 known, = ๐ฆ1 โ which ๐ + ๐๐ฅis 1 almost if a and๐ฆ1bโare โฎ case. never the ๐ฆ๐ โ ๐ฆ๐ = ๐ฆ๐ โ ๐ + ๐๐ฅ๐ which are the deviations from the estimated regression line. Residual Analysis Recall, me = 0. So, the numerator is really residual โ 0. Identifying residuals with unusually large magnitudes is made easier by inspecting standardize residuals. residual standardized residual= estimated standard deviaiton of residual Because residuals at different x values have different standard deviations (depending on the value of x for that observation), computing the standardized residuals can be tedious. Most statistical software will perform this calculation. Revisiting the Elk Example 16.3 introduced data on x = chest girth (in cm) and y = weight (in kg) for a sample of 19 Rocky Mountain elk. Inspection of the scatterplot suggest the data are consistent with the assumptions of the simple linear regression model. Revisiting the Elk Continued . . . Letโs examine the residuals more closely. The data, residuals, and the standardized residuals (computed using Minitab) are given on page 761. The largest residual = 38.1397 and the associated The of the residuals and standardized Neither one ofboxplots these standard residual = 1.81294. residuals is surprisingly large.are approximately symmetric with no The smallest residual and associated Notice that =the boxplots of the theof residuals outliers, so-38.2661 the assumption normallyand standard residual = -1.92313. standardized residuals are nearly identical. distributed errors seems reasonable. Revisiting the Elk Continued . . . Another way to assess whether the error values are normally distributed is to look at normal probability plots of the residuals or the standardized residuals. (Only one plot is The needed.) pattern in the normal probability plots are The standardized plot is recommended, but it is reasonably straight, confirming that the acceptable of to normality use the unstandardized residual plot assumption of the error distribution if you do not have access to a computer package is reasonable. A Look at Residual Plots This is a desirable plot in that it exhibits no pattern and has no point that lies far away from the other points. Both of these plots contain points far plot awayThis from theexhibits a curved In this plot, the standard deviation of the others. These pattern which indicates that residuals increases as the x-values increase. While points can have model should be the fitted a straight-linesubstantial model might still be appropriate, effects changed to incorporate the the best-fit lineonshould be found using weighted estimates of a curvature. least-squares. and Consult your b as well aslocal statistician! other quantities. Newborns and infants have a small trachea, and there is little margin plots for error when inserting tracheal tubes. Residual like the one shown here are desirable. Using X-rays ofunusually a large number of children ages months There are no large residuals since no 2 point lies to 14 years, researchers examined thebetween relationships much outside the horizontal band -2 and 2. between appropriate trachea depth and There is no point far to thetube left insertion or right of the others other variables as height, weight, and age. and there are such no pattern of curvature or differences in the variability of the residuals for different height Below aretoa indicate scatterplot standardized residual values thatand thea model assumptions are plot not constructed using data reasonable. on the insertion depth and height of children (both measured in cm). Newborns and Infants Problem Continued . . . But consider what happens when the relationship between insertion depth and weight is examined. A careful inspection of these plots suggests that along While some curvature is evident in the original with curvature, the residuals may be more variable at The clearly linear regression scatterplot, it is even more visible in the larger weights. standardized residual model plot. is not appropriate.