Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 2441 Probability and Statistics for Biological Sciences Statistical Inferences Involving Linear Regression and Linear Correlation This document summarizes important types of statistical inference that can be done on quantities arising in and from a simple linear regression model and the linear correlation coefficient. We looked at these two approaches to describing or characterizing relationships between two variables earlier in the course. Review of Basic Formulas Methods of simple linear regression and correlation are applied in situations in which we have two quantitative variables, x and y. It is thought that either the values of x and y tend to vary together (either increasing or decreasing together, or when one variable increases in value, the other tends to decrease in value or vice versa). At this level, the goal is to first determine whether or not that relationship is linear, and if so, to characterize or quantify the apparent linear relationship between the two variables. In a regression study, there is also the notion that the variable x is the independent or control variable -- that it is the value of x which influences or partially determines the value of y. The variable y is thus considered to be a dependent variable. The value of y may be influenced by the values of a large number of variables other than x as well. For the simple linear regression study to be useful, none of those other influences can be very important. However, because of their existence, we are not able to predict the value of y precisely knowing just the value of x. Even if the relationship between y and x itself is a linear one, the observed values of y will not satisfy a linear equation involving just x, and so when we plot the data as points on a graph, they will not lie on a perfect straight line. On the other hand, in correlation studies, we are simply attempting to establish the existence of a linear relationship between x and y without trying to attribute control or dependence of one variable on the other. When we first considered the simple linear regression and linear correlation models as methods of descriptive statistics, we primarily viewed them as procedures for summarizing data that appeared to indicate the presence of some sort of a linear relationship. We pictured an experiment being carried out in which for each of a set of n values of x, {x1, x2, …, xn}, a value of y was observed: {y1, y2, …, yn}. Together, these two sets of observations amounted to n pairs of values: (x1, y1), (x2, y2), … (xn, yn). When these pairs of values were plotted as points on a set of x-y axes, the resulting "scatterplot" of points, while not forming a perfect straight line, might appear very much as if the points are scattered randomly along a straight line path. Arguments were presented earlier that the so-called correlation coefficient r SS xy (LR - 1) SS x SS y where n SS x x k 2 n xk n k 1 2 x k nx 2 n k 1 2 n yk n k 1 2 y k ny 2 n k 1 k 1 n SS y y k k 1 (LR - 2a) (LR - 2b) and SS xy n n xk yk n n k 1 k 1 xk yk xk yk n x y n k 1 k 1 © David W. Sabo (1999) Inference in Linear Regression and Correlation (LR - 2c) Page 1 of 20 was a single number that could be used to measure the degree to which the points of the scatterplot clustered about a straight line. If r = 1, the points lie on a perfect straight line of either positive or negative slope, and values of r between +1 and -1 indicated lesser degrees of clustering about a straight line pattern, with value of r near zero indicating little or no discernable linear pattern to the points. In the case of linear regression analysis, we actually attempted to come up with an equation for the straight line most closely fitting the points in the scatterplot. We wrote that equation in the standard form ŷ b 0 b1x (LR - 3) The coefficients, b0 and b1, were then to be computed so that the sum of the squares of the vertical distances of the points from the line was as small as possible -- the so-called "least squares" criterion of determining what is meant by the best fit straight line. We define the residual of point #k as k y k ŷ k (LR - 4) where ŷ k ŷ evaluated with x = xk. Thus, k is the vertical distance by which the point (xk, yk) is separated from the best fit straight line given by (LR - 3). Then, the least squares principle means that we choose b 0 and b1 so that the quantity 2 2 SSE k y k ŷ k y k b0 b1x k n k 1 2 n k 1 n (LR - 5) k 1 is made as small as possible. This is not a difficult problem to solve symbolically, and we end up with the formulas b1 SS xy and SS x b0 y b1x (LR - 6) where x and y are the mean values of the x and y coordinates, respectively, for the points in the scatterplot, and SSxy and SSx are calculated using formulas (LR - 2a-c) above. The "goodness" of the best-fit straight line was then simply evaluated visually (did it seem to come quite close to all of the points), or, by computation of the so-called coefficient of determination: r2 SST SSE SSE 1 SST SST (LR - 7) which turned out to be the square of the value of the correlation coefficient, r, defined in formula (LR - 1) above. However, with the definitions SSE 2 SS y 2b1SS xy b1 SS x SS y b1SS xy SS y SS xy SS x 2 (LR - 8) obtained by substituting formulas (LR - 6) into (LR - 5), so that (LR - 8) is a measure of the variation of the observed y-values about the best fit straight line, and SST y k y n k 1 2 ( SS y ) (LR - 9) is a measure of the variability in the y-values without reference to the straight line, we get an alternative, and very informative interpretation of r2. If all of the points lay exactly on a straight line, SSE given by (LR - 8) would be exactly zero, and thus r2 would be exactly 1. When not all of the points fall on a straight line, we conclude that the value of x is insufficient to completely determine the value of y, and hence, SSE is a Page 2 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) measure of how much of the variability in the y-values is not accounted for by reference to x. Since SST measures the variability in the y-values without any reference to x at all, the ratio SSE/SST is the fraction of the variability in the y-values that is not accounted for by reference to x. Thus, r2 is the fraction of the variability in the y-values that is accounted for by reference to the values of x. In a sense, you can think of the decimal value of r2 as telling you the fraction of the variation in the y-values that can be attributed to the fact that they correspond to different x-values. All of these formulas and interpretations were based on intuitive arguments when we looked at them earlier. Now, however, we can begin to reinterpret that work from a more statistical point of view, which will allow us to characterize the "significance" of the numbers more rigorously. While it is beyond the scope of this course to examine the mathematical background of linear regression and correlation to any depth, we need to go over a few basic ideas so that you will be able to understand the various methods presented later in this document. The Simple Linear Regression Model Because we hadn't discussed probability and random sampling when we first looked at the problem of describing relationships, it wasn't possible to point out that we are really considering y to be a random variable in the linear regression model, and both x and y to be random variables in a correlation model. That is, even if you know the value of x in advance, you cannot predict with certainty the precise value of y that will be observed. We expect, or at least hope, that the value of x will have a strong influence on the value of y -- that's why we're trying to characterize or quantify the relationship between the two quantities -but we know that since the scatterplot does not give points lying on a perfect straight line, there must be other things that we are not taking into account that are also influencing the value of y = 0 + 1x y y. The situation in a linear regression study is illustrated in the diagram to the right. For a given value of x, the possible values of y are considered to form a normal distribution, with some mean value and standard deviation. The formulas and methods to follow make the assumption that the standard deviation of these distributions does not change as x varies. On the other hand, the mean values of y for different values of x do change. A linear regression model applies when the graph of those mean values of y lie on a straight line: y = 0 + 1x 0 + 1xn 0 + 1x2 0 + 1x1 x x1 x2 xn (LR - 10) We consider (LR - 10) to be the equation of the line passing through this succession of mean values of y for varying values of x. Thus, (LR - 10) amounts to the "population regression line" or "true regression line," that results when all possible observations of y for every possible value of x is included. (This is the same notion of a statistical population that we've used throughout the course: "all possible things of a certain type." Here, those "things" are pairs of (x, y) values rather than apples or fish or cans of soup, etc.) Then the equation (LR - 3) ŷ b 0 b1x (LR - 3) containing actual numerical values of the coefficients b 0 and b1 that were calculated from the n observed pairs of values (xk, yk) can be viewed as the "sample regression line," and forming a "point estimate" of the population regression line in (LR - 10). Thus, b0, calculated from the data, will be used as an estimate of 0 and b1. calculated from the data, will be used as an estimate of 1. If two experiments are performed, the n pairs of values, (xk, yk), that result will be different, so the values of b 0 and b1 will be different, and so we see that b0 and b1 are actually random variables. © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 3 of 20 By a similar argument, we find that the correlation coefficient, r, (or more correctly, the sample correlation coefficient r) is a random variable, that can be used to estimate a corresponding population correlation coefficient . Because in this case, both x and y are considered random variables, the probability picture is a bit more involved than the one for the linear regression case sketched above (actually, the sketch above is a bit of a simplification anyway.) It involves the so-called "bivariate normal distribution," which amounts to a normal distribution in three-dimensions -- a shape similar to a real brass bell! The study of the detailed mathematical properties of the bivariate normal distribution is beyond the scope of this course, so the relevant consequences will just be stated as required. In fact, all of the descriptive measures mentioned so far: b0, b1, ŷ , r, etc. are random variables whose values can be used to estimate various corresponding population parameters. The rest of this document will describe a few of the more important techniques. Throughout, the basic issues are similar to the reason behind all of the statistical inference we've done until now, namely: what is the probability that the values or patterns observed in the sample data really reflect a pattern or value in the population from which the sample data was obtained. Is There Really a Linear Relationship Between y and x? The first question to be handled is whether the apparent linear relationship we see in the sample data really reflects a linear relationship in the population. We've already looked at informal ways to assess this issue. To make the answering of this question a bit more rigorous, we need to set up a hypothesis test along the lines: H0: there is no linear relationship between y and x vs. HA: there is a linear relationship between y and x The most common way to test these hypotheses is a procedure based on the F-distribution. Organize the relevant information into a table with the following standard form: Source of Variation Degrees of Freedom Sum of Squares Mean Square 1 SSR SSR MSR 1 Error n-2 SSE Total n-1 SST Regression MSE s 2 F MSR F MSE SSE n2 For reasons that will become clear later in the course, this summary of information in this tabular format is often called an ANOVA table. Table entries that have not been defined in preceding formulas are defined by the formulas in the table itself. Notice that the third row is just the sum of the first and second rows. The means squares are just the sum-of-squares quantities divided by the degrees of freedom hence the some what strange way of writing the formula for MSR. (In regression models involving k independent variables, the regression degrees of freedom would be k, and so the denominator in the formula for MSR would be k rather than 1.) The quantity in the last column, F, is an F-distributed random variable with the numerator degrees of freedom equal to 1 and the denominator degrees of freedom equal to n - 2. If the points all lie exactly on a straight line, SSE and hence MSE will be zero, and so F will be +. This means that the rejection region for the null hypothesis above is a right-tailed rejection region: reject H0 if F > F, 1, n - 2 and of course, this means that p-value = Pr(F > value given in the table) Page 4 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) It turns out that this F-test gives identical results to the t-test for 1 (which we will describe shortly) when there is just one control variable in the problem. However, when more than one control variable is present a situation which the Food Technology students will explore in greater detail next term then this F-test gives results not duplicated by other methods described below. As we proceed through this document, we'll illustrate the various types of calculations or analysis using the one or more of the four examples first described in the earlier document on characterizing relationships. Those examples were: Example 1: the simple six point example, Example 2: the nectarine size example, Example 3: the potato yield example, and Example 4: the "no apparent linear relationship" example. Example 5: the "non-linear relationship" example For easier reference, we repeat a somewhat expanded version of the summary table given near the end of the document on characterizing relationships: Simple Example: n PotatoYield NectarineSize No Linear Non-Linear 6 38 65 26974 50 11918 100 24572 70 24757 200 3487.1 8790 15525 38401 xk 2 320 12085222 3427438 7851466 10232087 2 7402 191947.93 1617678 3223933 22015057 xk yk yk xk yk SSx SSy SSxy b0 b1 SSE r r2 1507 1499131.4 1909678 3781584 14498751 79.33333333 735.33333333 240.33333333 14.14705882 3.02941176 7.26470588 0.99504800 0.99012053 891426.9846 4873.062154 52038.54769 29.42226827 0.058376680 1835.224515 0.789553029 0.623393986 586663.52 72396 -185506.4 251.1708114 -0.316205787 13737.80281 -0.900133800 0.810240858 1813634.16 813676.75 -33219 159.7506721 -0.018316263 813068.3021 -0.027345493 0.000747776 1476243.443 948816.9857 917414.4714 328.7958901 0.621452021 378687.9081 0.775167168 0.600884139 Example 1: For the simple example, n = 6, so the Error degrees of freedom are n - 2 = 4, and the Total degrees of freedom are n - 1 = 5. SST = SSy = 735.33. SSE has already been calculated, given as 7.2647, so that SSR = SST - SSE = 735.3333 - 7.2647 = 728.0686 and this gives the value of MSR = SSR/1 as well. Since MSE = 7.2647/4 = 1.8162 we get F = MSR/MSE = 728.0686/1.8162 = 400.88 Thus, the full ANOVA table here becomes: © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 5 of 20 Source of Variation Regression Error Total Degrees of Freedom 1 4 5 Sum of Squares 728.0686 7.2647 735.3333 Mean Square 728.0686 1.8162 F 400.88 Now, from the table of critical values of the F-distribution, we get F0.05,1,4 = 7.71, so that we can reject H0 at a level of significance of 0.05 if the value of our F-statistic exceeds 7.71. Since 400.88 well exceeds 7.71, we can reject H0 at a level of significance of 0.05. The conclusion is that the data supports the existence of a linear relationship between y and x at a level of significance of 0.05. For reference purposes and to serve as practice problems for you, we just state the ANOVA tables for the other four data sets with just brief comments. You can calculate all of these numbers using information from the summary table above. PotatoYield: Source of Variation Regression Error Total Degrees of Freedom 1 63 64 Sum of Squares 3037.84 1835.22 4837.06 Mean Square 3037.84 29.1305 F 104.28 With denominator degrees of freedom being 63, we need to compare this test statistic with the closest critical value available in our tables of critical values of the F-distribution: F0.05,1,60 = 4.00. Again, the criterion for rejecting H0 is met in great excess at = 0.05, and so we conclude without hesitation that the PotatoYield data reflects a linear relationship between yield and mm of available water. The p-value here is 5.5 x 10-15, reinforcing that conclusion, and also giving you an idea of how small the p-value of this F-test is when a fairly large number of data points can be seen to cluster at least loosely about a straight line. NectarineSize Source of Variation Regression Error Total Degrees of Freedom 1 48 49 Sum of Squares 58658.2 13737.8 72396 Mean Square 58658.2 286.204 F 204.95 If we compare this value of F with a critical value of just over 4.00 from our = 0.05 F-table, we see that the rejection criterion is again met far beyond the requirement. (The p-value here is about 3.9 x 10-19.) Thus, we can conclude with no practical probability of error that this data supports a conclusion that the size of nectarines is related linearly to the crop load. No Apparent Linear Relationship: Source of Variation Regression Error Total Degrees of Freedom 1 98 99 Sum of Squares 608.4479 813068.3 813676.8 Mean Square 608.4479 8296.615 F .0733 From the first three examples, it was beginning to look like this F-test always gives values of the test statistic in the hundreds and p-values that are zero for all practical purposes. However, for this set of data, constructed to have no discernible linear pattern, you see that we get a value of 0.0733 for the test statistic, whereas rejection of the null hypothesis in this case at = 0.05 would require the test statistic to be greater than approximately 4.00. So we cannot reject H0 here at = 0.05. In fact, the p-value in this case is 0.787. Page 6 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) What may be a bit surprising are the following results for the example in which we saw that the points clearly did not follow a linear path. Non-linear Relationship: Source of Variation Regression Error Total Degrees of Freedom 1 68 69 Sum of Squares 570129.078 378687.908 948816.986 Mean Square 570129.078 5568.93983 F 102.38 The p-value for this F-test is also a miniscule 2.87 x 10-15, which indicates incredibly strong evidence in favor of the alternative hypothesis over the null hypothesis here yet we know that HA: there is a linear relationship between y and x is false here. What's happening is worth a caution. The best-fit straight line for the data is the slanted line in the figure to the right. What the F-Test is detecting is that even with the curved pattern of points, the points are on the average closer to this line than they are to the horizontal line at the mean of the y-values. If you like, the F-test is just measuring how much better a straight line is than no line at all, but that doesn't necessarily mean that the straight line itself is all that good a representation of the data. This is why you must always do a scatterplot to verify visually that a straight line model is plausible. Curved Pattern 900 800 700 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 Testing Hypotheses Involving the Slope, 1 The null hypothesis can be written H0: 1 = 1,0 where 1,0 is some specific numerical target value. Most commonly, this value 1,0 is zero, and so the result of the hypothesis test would be a decision as to whether the data supports the conclusion of a slope that is nonzero in some way. Under the conditions already described, the standardized test statistic to use here is t b 1 1,0 (LR - 11a) s b1 where s b1 s SS x MSE SS x (LR - 11b) is an estimate of the standard deviation of the sampling distribution of b 1. Then, just do the t-test (with = n - 2 degrees if freedom): © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 7 of 20 Table 1. Hypotheses: H0: 1 = 1,0 HA: 1 > 1,0 reject H0 at a level of significance if: t > t, (single-tailed rejection region) H0: 1 = 1,0 HA: 1 < 1,0 t < -t, (single-tailed rejection region) Pr(t < test statistic value) H0: 1 = 1,0 HA: 1 1,0 t > t/2, or t < -t/2, (two-tailed rejection region) 2Pr(t > test statistic value) p-value Pr(t > test statistic value) The hypotheses tests here when the value 1,0 is zero are particularly important because rejection of H0 means that the part of the equation for y which involves x does not disappear that is, rejection of H0 means that a linear relationship really does exist between y and x. Thus, this test gives a similar sort of conclusion as does the F-test we described earlier. Thus, it is not a surprise that the p-value for the twotailed t-test in this situation is equal to the p-value for the F-test. (This is true only when the regression model contains just one control variable.) Example 6: A food technologist is studying the relationship between brine concentration (%) and the iron content of pickled fish of a certain species. She takes 15 specimens of the fish pickled in various brines, and determines both the brine concentrations and the iron concentrations in the fish in ppm. For the purposes of this experiment, the iron concentrations are considered to be the dependent variable (y), since it is difficult to imagine how small variations in the iron content of the fish (at a ppm level) could have a measurable effect on the concentration of the brine surrounding the fish. Anyway, the data she collected is given in the table to the right. From this data, we can compute SSx = 86.4493, SSy = 472.6133, SSxy = 165.1767, which gives SSE = 157.0142. Further, the equation of the regression line comes out as ŷ( x ) 45 .11 1.911 x Is this data adequate evidence to conclude that the average ppm Fe is linearly related to concentration of the brine? Further, can we conclude that the Fe content increases by more than one ppm for each additional percent concentration of the brine? Solution Specimen 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Concentration of brine (%) Fe content (ppm) 3.8 3.9 4.3 4.3 4.8 4.9 7.4 7.5 7.9 9.1 9.2 9.3 9.7 10.1 10.2 51.8 52.5 59.3 52.8 55.6 50.7 53.9 54.9 65.2 65.8 62.9 60.4 61.3 66.8 66.1 The two questions here require two somewhat different approaches. To answer the first question: Is this data evidence of a linear relationship between brine percentage and ppm Fe, we could use the F-test described earlier, or we could test the hypotheses H0: 1 = 0 vs HA: 1 0 For the F-test, the ANOVA table is: Source of Variation Regression Error Total Page 8 of 20 Degrees of Freedom 1 13 14 Sum of Squares 315.5991 157.0142 472.6133 Mean Square 315.5991 12.0780 Inference in Linear Regression and Correlation F 26.13 © David W. Sabo (1999) leading to a p-value of 0.000199588 (obtained using Excel's FDIST() function). Thus, the F-test indicates the data is strong evidence in support of a linear relationship between ppm Fe and percent concentration of brine. We get the same result using the t-test on the hypotheses involving 1. First we calculate sb1 MSE 12.0780 0.37378 SS x 86.4493 Thus, the standardized test statistic has the value: t b1 1,0 sb1 1.911 0 5.112 0.37378 This is quite a large value -- much larger than t0.025, 13 = 2.160 from our tables so that we can reject H0 in favor of HA here easily at a level of significance of 0.05. In fact, when we calculate the p-value for this test, we get p-value = 2 Pr(t > 5.113, = 13) = 2 x 0.000099794 = 0.000199598, just as we got for the F-test. To answer the second question here, we need to test a different set of hypotheses: H0: 1 = 1 vs HA: 1 > 1 If we are able to reject this H0 in favor of this HA, we will have demonstrated support for the claim that the slope of the best fit line relating ppm Fe to % concentration of the brine is greater than 1, equivalent to saying that the ppm Fe increases by more than one unit for each percent increase in the brine concentration. These hypotheses are tested again using the t-test, but now the standardized test statistic is: t b1 1,0 sb1 1.911 1 2.437 0.37378 We can reject H0 at a level of significance of 0.05 if this test statistic is greater than t0.05, 13 = 1.771. Since 2.437 is greater than 1.771, we can reject H0, and so conclude that the ppm Fe does increase by more than one unit for each percent increase in brine concentration. (The p-value in this case is given by Excel's TDIST() function as 0.0150, which is well below the conventional cutoff at 0.05.) From formulas (LR - 11a, b) it is easy to deduce that the formula for the 100(1 - )% confidence interval estimate of 1 is 1 b1 t / 2, sb1 (LR - 12) Example 7: Construct a 95% confidence interval estimate of the slope of the true regression line based on the ppm Fe vs brine concentration data in Example 6. © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 9 of 20 Solution We just substitute the required (already calculated) values into formula (LR - 12). The degrees of freedom, , here are n - 2 = 15 - 2 = 13, and from our t-tables, we get that t0.025, 13 = 2.160. Thus, 1 = 1.911 (2.160)(0.37378) or 1 = 1.911 0.807 @95% 1.104 1 2.718 @95% or This means that there is a 95% probability that the interval 1.104 to 2.718 captures or contains the value of the slope of the true regression line in this case. Estimation and Prediction of y-values The formula ŷ b 0 b1x (LR - 3) gives us information about the values of y that may occur for a specific value of x. Generally, there are two issues of interest here, distinguished by the keywords estimation and prediction. We use the word estimation to indicate the process of calculating an estimate of the mean of all of the yvalues corresponding to a specific value of x, which we will indicate by the symbol y x . The symbol indicates that we are speaking of a population mean, the subscript 'y', indicates that it is a mean value of y, and the subscript 'x' indicates that this value depends on the specific value of x selected. The value of ŷ( x ) serves as an acceptable point estimator for y x . Of course, ŷ( x ) is a random variable (because the formula (LR - 3) is based on the data in a random sample of observations), and so to obtain an interval estimate of y x or to be able to test hypotheses involving y x , we need a formula for the variance of the sampling distribution of ŷ( x ) . The underlying probability theory here is quite complex and really just beyond the scope of this course, so we will just state the resulting formulas and demonstrate their use. It can be shown mathematically that t ŷ( x*) y x * (LR - 13) s ŷ( x *) is a t-distributed random variable with n - 2 degrees of freedom. In this formula, we have used the symbol x* to indicate a specific value of x is being considered. The denominator, s ŷ( x*) indicates the standard deviation of the sampling distribution of ŷ( x ) for that particular value of x, and is given by the formula 1 x * x 2 s 2ŷ( x *) s 2 SS x n (LR - 14) where s2 is a synonym for MSE, defined in the ANOVA table earlier. The form of this expression has some important practical implications. The first term in the brackets gives the usual s/n type of expression for the standard deviation of a sampling distribution. The second term is novel here, though. Notice that the second term is zero when x * x , that is, when x* is at the horizontal "center" of the scatter plot of points, but increases in value if you move either rightwards or leftwards from that location. Since the precision with which we will be able to estimate y x * goes down as this standard deviation increases in value, what this Page 10 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) term in (LR - 14) is reflecting is that the greatest estimation precision will occur right in the middle of the cloud of points, and gets poorer if you move out towards the extremes of observations. Given these two formulas, we can now write down the usual statistical inference formulas for y x * . The 100(1 - )% confidence interval estimate of y x * is given by y x * ŷ( x*) t / 2,n2 s ŷ( x *) @ 100(1 - )% (LR - 15) and the null hypothesis H0 : y x * 0 (LR - 16a) can be tested by applying the t-test formulas to the standardized test statistic t ŷ( x*) 0 s ŷ( x *) (LR - 16b) Example 8: Using the data given in Example 6 above, obtain 95% confidence interval estimates of the mean ppm Fe in fish pickled in a 7% brine solution and of fish pickled in a 10% brine solution. Solution We are being asked to construct a 95% confidence interval estimate for y x * for the values x* = 7.0 and x* = 10.0. For the data given in Example 6, it is easy to determine that x = 7.0933. We already know from previous calculations that n = 15, SSx = 86.4493 and s2 = MSE = 12.0780. From the t-table, we get that t/2,n-2 = t0.025, 13 = 2.160. Thus, we just combine formulas (LR - 14) and (LR - 15), substituting in the required numbers to get: y 7.0 ŷ(7.0) t 0.025,13 s ŷ(7.0 ) @95% for which we need ŷ(7.0) 45 .11 1.911 7.0 58 .4883 ppm and 1 7.0 x 2 12 .0780 2 s ŷ( 7.0 ) s 2 SS x n So 1 7.0 7.0933 2 0.8980 86 .4493 15 y 7.0 58.4883 2.160 0.9074 58.4883 1.9397 @ 95% We should probably round these numbers down to about 2 decimal places at the most before reporting the result, but the precision above will help you confirm that you've got the calculations right when you attempt this on your own. In words, this result means that there is a 95% probability that pickled fish of this type that have been prepared in a 7.0% brine solution will have a mean ppm Fe which is contained in the interval 58.48 1.94 ppm. The calculation for x* = 10.0% is done in exactly the same way, substituting the value 10.0 wherever 7.0 occurs in the above calculation. In summary, we get ŷ(10 .0) 45 .11 1.911 10 .0 64 .2204 ppm © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 11 of 20 1 10 .0 x 2 12 .0780 2 s ŷ(10.0 ) s 2 SS x n 1 10 .0 7.0933 2 1.4091 86 .4493 15 and so, finally, y 10.0 64.2204 2.160 1.4091 64.2204 3.0437 @ 95% Notice particularly how much larger s ŷ(10.0 ) is in value than is s ŷ(7.0) , and as a result, how much wider the confidence interval is for y 10.0 than it is for y 7.0 . This is due to the fact that x* = 7.0 is much nearer to the value of x = 7.0933 than is the value x* = 10.0, and so the second term in formula (LR - 14) is makes a much larger contribution to the standard deviation when x* = 10.0 than when x* = 7.0. Example 9: Does the pickled fish data given in Example 6 support a claim that when this type of fish is pickled in a 7% brine solution, the mean ppm Fe is less than 60 ppm? Solution To answer this question, we need to test the hypotheses: H0 : y 7.0 60 ppm vs HA : y 7.0 60 ppm The value of the standardized test statistic for these hypotheses can be calculated from numbers already obtained in Example 8, just above. We get t ŷ(7.0) 0 58 .4883 60 1.683 s ŷ(7.0 ) 0.8980 The hypotheses here require a left-tailed test: we can reject H0 at a level of significance of 0.05 if the standardized test statistic has a value less than -t0.05,13 = -1.771. Since -1.683 is not quite less than -1.771, we cannot reject H0 here at a level of significance of 0.05, and so the original claim is not supported at that level of significance. (From Excel's TDIST() function, we get that the p-value is 0.05808 here.) On the other hand, we use the word prediction in statistics to refer to the attempt to forecast the actual values that may be observed for a random variable. So, whereas the word estimation has been used when we speak of estimating the value of a population parameter such as a mean value (for instance, in this discussion, the mean value of y for a specific value of x), we use the word prediction to speak of forecasting what value of y may result when an individual observation, y p(x*), is made for a specific value of x* of x. Of course, a single number result here is not too meaningful -- what is more useful is an "interval estimate" type of result for this value. Again, the best "predictor" of the value of y that we are likely to observe is ŷ( x *) , which is a point estimator of the mean, y x * , of all of the y-values that could be observed for x = x*. A measure of the uncertainty in such a prediction is the variance of the error, Var[y - ŷ( x *) ]. But by results quoted much earlier in the course, we can write Var y ŷx * Var y Var ŷx * Page 12 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) 1 x * x 2 2 2 SS x n x * x 2 1 2 1 n SS x (LR - 17) where the variance, 2, of y, would in practice be estimated by the value of s 2 = MSE. Thus, the variance of the individual values of y is given by a formula very similar to (LR - 14), the variance for ŷ( x *) -- the only difference is the additional term "1" in the square brackets. For values of x near x , this can be a relatively large contribution to the value of the expression in square brackets. Finally, since ŷ( x *) is essentially t-distributed with n - 2 degrees of freedom under the conditions normally adopted in linear regression work, we can write: y p x * ŷx * t / 2,n2 s 1 x * x 1 n SS x 2 @ 100 1 % (LR - 18) This formula means that there is a probability of 100(1 - )% that the next value of y observed when x = x* will lie between the numbers given when the + and - signs are used. But this means that the two limits given by (LR - 18) enclose 100(1 - )% of the possible individual y-values. For this reason, formula (LR - 18) is not normally used to predict the value of a single observation of y, but rather, to compute the interval containing the middle 100(1 - )% of the possible individual y-values that may be observed. In fact, the limits of both the estimation interval, (LR - 15), and this prediction interval (LR - 18) are functions of x, and so it is possible to calculate these limits in each case for a succession of values of x, and plot them on a graph. The upper and lower limits of the estimation and prediction intervals then form pairs of curves above and below the regression line, enclosing a regions located symmetrically about the regression line. These regions are called, respectively, the 100(1 - )% estimation band and the 100(1 - )% prediction band for the regression line. Example 10: Construct a graph of the 90% estimation and prediction bands based on the data presented in Example 6 above. Solution The calculation of values of the limits given by (LR - 15) and (LR - 18) for a succession of values of x* is a very tedious task to do by hand. Instead, we've set the formulas up in an Excel/97 spreadsheet. The following are just some of the calculated values obtained using = 0.10: x* 1 4 7 9 ŷ( x*) y x* () 47.02428 52.75631 58.48834 62.30969 51.35963 55.34831 60.07871 64.33909 y x * ( ) y p ( x *) ( ) y p ( x*) ( ) 42.68894 50.16431 56.89796 60.28028 54.55271 59.43467 64.84532 68.79047 39.49585 46.07795 52.13135 55.82891 Here y x * ( ) and y x * ( ) are the results obtained from formula (LR - 15) using the + and - signs, respectively. Similarly y p ( x*) ( ) and y p ( x *) ( ) the results obtained from formula (LR - 18) with the + and - signs respectively. © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 13 of 20 The pair of curves closest to this straight line are the graphs of y x * ( ) and y x * ( ) vs x*. You see that these two curves taken as a pair seem to form a band that is narrowest around x = 7, the vicinity of the mean of the x-values for which data was obtained. For a given value of x, the vertical interval between these two curves is the 90% confidence interval estimate for the mean value of y corresponding to that value of x. Iron Content of Pickled Fish 80 75 70 Fe (ppm) The graph to the right shows the relationship between a variety of items arising in this example. The plotted points are the fifteen data points forming the original scatterplot. The straight line in the middle of the pattern is the actual best fit straight line calculated using formula (LR - 3). 65 60 55 50 45 40 0 2 4 6 8 10 12 brine (% ) The outer two curves are the graphs of y p ( x*) ( ) and y p ( x *) ( ) . These also form a band which is narrowest in the vicinity of the mean of the x-values for which observations were taken. The vertical interval between these two curves contains 90% of the values of y for that value of x. Notice that because of the shape of these "bands", we get the most precise estimate of the mean value of y and of the interval containing the next value of y to be observed in the vicinity of the mean of the values of x observed, and both of these "estimates" get less precise as you move away to the left or the right. Thus, you should always plan your experiments so that the mean of the observed x-values falls in the vicinity of the x-values for which you may want to calculate y x * or yp(x*). Inferences About the Correlation Coefficient The theory here gets well beyond the study of probability theory that we are able to do in this course. It involves the so-called bivariate normal distribution, which must be graphed in three-dimensions. Its graph would be a surface something like an actual brass bell. The methods below are valid only if the (x, y) data obeys the bivariate normal distribution at least approximately. There are no simple tests to verify this, but there is a sort of a negative test. If the values of x by themselves or the values of y by themselves are not approximately normally distributed, then taken together as pairs of values, (x, y), they cannot be approximately bivariate normally distributed. So, you could prepare normal probability plots of the x's and the y's separately. If either plot caused serious concern that the data was not adequately consistent with the normal distribution, then the methods we now describe are probably not valid. We developed and illustrated the calculation and rule-of-thumb interpretation of the sample correlation coefficient, r, earlier in the document on characterizing relationships. This sample correlation coefficient, r, is, of course, a random variable, estimating the population correlation coefficient, . It turns out that to test the null hypothesis H0: = 0 we need to consider two distinct situations: one when 0 = 0, and the other when 0 is nonzero. Page 14 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) (i) Testing H0: = 0 In this case, the quantity t r n2 (LR -19) 1 r2 has the t-distribution with = n - 2 degrees of freedom. Calculate this quantity, and apply the t-test rules. Example 11: Does the data given in Example 6 above support a claim that there is a nonzero correlation between % concentration of the brine and the ppm Fe in the pickled fish? Solution This question is asking us to test the hypotheses H0: = 0 vs (LR - 20) H0: 0 where is the population correlation coefficient between the two variables described. For completeness, we start by displaying the normal probability plots for the x- and y-values here. They are Normal Probability Plot for X-values Normal Probability Plot for Y-values 12 70 % Concentration of Brine 10 65 ppm Fe 8 6 60 4 55 2 0 -2 -1 50 0 1 2 -2 -1 0 1 2 Neither of these are particularly great. The plot for the x-coordinates shows a definite heavy-tailedness, and the plot for the y-values is not all that tightly clustered about a straight line. These would normally be cause for considerable concern. However, because the value of the standardized test statistic will turn out to be so much larger than necessary to reject H0 at a level of significance of 0.05, we can probably safely proceed here -- keeping in mind that our results may be somewhat optimistic. The values of SSx, SSy and SSxy are given in Example 6, so we can calculate immediately that r SS xy SS x SS y 165 .1767 86 .4493 472 .6133 0.8172 Thus, the value of the standardized test statistic is t r n2 1 r 2 © David W. Sabo (1999) 0.8172 15 2 1 0.8172 2 5.112 Inference in Linear Regression and Correlation Page 15 of 20 Since this is a two-tailed test, we reject H0 at = 0.05 if t > t0.025,13 = 2.160 or if t < -t0.025,13 = -2.160. Since 5.112 is much greater than 2.160, we can comfortably reject H0 at a level of significance of 0.05, concluding that there is a nonzero correlation coefficient between the variables x and y in this example -- a linear relationship does exist. The p-value for this test works out to be 0.0001996, but we may not want to make too much of this precise result because of the concerns over the potential failure of the population to obey a bivariate normal distribution. Note three things about the method just illustrated: (i) Testing H0: = 0 is a quite a bit more specific way of evaluating the presence of evidence of a linear relationship that is the "rule of thumb" given in the document on characterizing relationships. For one thing, the hypothesis test takes into account sample size, which we have seen is a factor of considerable importance in statistical work. (ii) We can't really exploit this procedure to deduce a formula for a confidence interval estimate of based on the t-distribution. Because H0 states that = 0, a confidence interval that allowed the possibility of having one of a range of values about zero would be contradictory. The more general method described just below does allow the derivation of a confidence interval estimate formula for . (iii) You may have noticed that the value of the standardized test statistic, t, just above looks a bit familiar. In fact, it is precisely the value we got when testing H0: 1 = 0 vs (LR - 21) HA: 1 0 earlier in Example 6. In fact, in this case, the hypotheses (LR - 20) and (LR - 21) are completely equivalent. In fact, we can show that the formulas for the two test statistics are identical. Starting with the standardized test statistic for (LR - 21), we can proceed as follows: SS xy b t 1 s b1 SS xy SS x MSE SS x SSE SS x (n 2) SS x But, from formula (LR - 8) earlier SSE SS y SS 2xy SS x SS x SS y SS 2xy SS x Substitute this into the denominator of the previous equation: SS xy SS xy t SS x SS x SS y SS 2xy SS 2x (n 2) SS x 2 1 SS xy SS x SS y 1 SS SS x y SS x 1 n2 This looks pretty bad, but all we've done is some factoring in the denominator. Now, invert and multiply the factor involving the square root of n - 2 -- it will appear in the numerator of the result. The SSx in the denominator of the numerator cancels the SSx in the denominator of the denominator. The small square root left in the denominator can be moved into the denominator of the numerator, leaving Page 16 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) SS xy SS x SS y t n2 2 1 SS xy SS x SS y r n2 1 r2 since SS xy SS x SS y r and so 1 SS 2xy SS x SS y 1 r2 Thus, we've demonstrated algebraically that the values of the test statistics for the two tests are given by equivalent formulas. It doesn't matter which of these two pairs of hypotheses you test -- you'll get the same result, and so they both lead to the same interpretation. (ii) Testing H0: = 0 (0 0) The formulas get a bit more complicated here, but not beyond what your mathematical skills should be able to handle. Here, we must use something called the Fisher Transform, given by V 1 1 r ln 2 1 r (LR - 22) where "ln" is the symbol for the natural logarithm. Then it turns out that if the x and y-values obey a bivariate normal distribution, the quantity (actually, a sample statistic, since its value depends on the sample statistic r) V is normally distributed with mean and standard deviation given by V 1 1 ln 2 1 and V 1 n3 (LR - 23) Thus, to test the hypothesis H0: = 0 (0 0) simply calculate the standardized test statistic V z 1 1 0 ln 2 1 0 1 n3 (LR - 24) and draw your conclusion based on the standard z-test rules. Example 12: Is the data for Nectarine Sizes, given in the summary table on page 5 of this document, evidence to claim that the population correlation coefficient between the size of nectarines and the crop load is less than -0.8 (that is, that there is a strong negative correlation between these two variables)? Solution We are asked to test the hypotheses H0: = -0.8 vs H0: < -0.8 © David W. Sabo (1999) Inference in Linear Regression and Correlation Page 17 of 20 The relevant data in the table on page 5 is that the sample was of size n = 50, and the sample correlation coefficient was r = -0.9001338. To calculate the required standardized test statistic, start with formula (LR - 22) V 1 1 r 1 1 0.9001338 1.4729 ln ln 2 1 r 2 1 0.9001338 Then, using formula (LR - 24), V z 1 1 0 ln 2 1 0 1 n3 1 1 0.8 1.4729 ln 2 1 0.8 2.566 1 50 3 The hypotheses here require a left-tailed test, so we would reject H0 at a level of significance of 0.05 if the value of z turns out to be less than -z0.05 = -1.645. But, since -2.566 is less than -1.645, we can reject H0 at a level of significance of 0.05, and so conclude that the data does support the conclusion that the correlation coefficient for these two variables is less than -0.8. (The p-value for this test is -0.0051.) Formulas (LR - 22) and (LR - 23) along with the fact that V is normally distributed can be exploited to derive a confidence interval formula for the correlation coefficient, . As usual when the normal distribution is involved, we start out with the statement Pr( z / 2 z z / 2 ) 1 but now, for z, substitute the generic equivalent of equation (LR - 24) 1 1 V ln 2 1 Pr z / 2 z / 2 1 1 n3 This means that there is a probability of 1 - that the following event will occur: V z / 2 1 1 ln 2 1 z / 2 1 n3 (LR - 25) Now, what we really need is just in the middle of this double inequality. This means we're going to have to rearrange things a bit (including solving a logarithmic equation -- you probably never expected to see one of those again!). It looks bad, but not much more than careful basic algebra is required. For example, take the left-hand inequality: V z / 2 1 1 ln 2 1 1 n3 (LR - 26) First isolate the logarithmic term: Page 18 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999) 1 1 1 V z / 2 ln 2 1 n3 It will be convenient to use some notation to make the equations a bit simpler looking. So, use the symbol 2 to stand for the right-hand side of this inequality: 1 2 V z / 2 (LR - 27) n3 so that now, we can write the preceding inequality as 1 1 2 ln 2 1 Now, we must solve for . The procedure is to get the logarithm absolutely by itself on one side: 1 2 2 ln 1 and then convert the equation to exponential form: 1 e 2 2 1 The inequality stays as it was, because the ln function is a strictly increasing function. Now we have a relatively simple algebraic equation for (keep in mind that the right-hand side of this last inequality is just a number). Solving, we get e 2 2 1 (LR - 28a) e 2 2 1 On the other hand, if you start with the right-hand inequality in (LR - 25) and isolate the logarithmic term, you come up with z 1 2 1 2 V / 2 ln n 3 1 (where the expression in the square brackets defines what we mean by 1 here). Solving this logarithmic inequality for gives e 21 1 (LR - 28b) e 21 1 Finally, putting (LR - 28a, b) together gives the desired result: e 21 1 e 21 1 © David W. Sabo (1999) e 2 2 1 e 2 2 1 @ 100 1 % Inference in Linear Regression and Correlation (LR - 29) Page 19 of 20 Example 13: Compute the 95% confidence interval estimate of the population correlation coefficient from NectarineSize data. Solution Very briefly, we've already determined that r = -0.9001 to four decimal places, and we know that the sample size n = 50. From the normal probability tables we get z0.025 = 1.96. In Example 12, we calculated V to be -1.4729. Now, calculate the quantities 1 and 2: 1 V z / 2 2 V z / 2 n3 1.4729 1.96 50 3 1.7588 and n3 1.4729 1.96 50 3 1.1870 Thus, formula (LR - 29) gives e 21.7588 1 e 21.1870 1 e 21.7588 1 e 21.1870 1 or -0.9424 -0.8297 @95% as the desired result. Thus, there is a probability of 95% that the interval between -0.9424 and -0.8297 contains the true value of the population correlation coefficient, , in this case. This is as far as we go with linear regression and correlation in this course. Those of you in Food Technology will be taking a subsequent course that will go deeper into the multiple regression and nonlinear multiple regression models, which tend to be very useful in research in food technology. Page 20 of 20 Inference in Linear Regression and Correlation © David W. Sabo (1999)