Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 12 Simple Linear Regression and Correlation Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.1 The Simple Linear Regression Model Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Linear Relationship The simplest deterministic mathematical relationship between two variables x and y is a linear relationship y 0 1x. The set of pairs (x,y) for which y 0 1x determines a straight line. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Terminology The variable whose value is fixed by the experimenter, denoted x, is the independent (predictor, explanatory) variable. For a fixed x, the second variable will be a random variable Y with observed value y, referred to as the dependent (response) variable. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. The Simple Linear Regression Model There exists parameters 0 , 1 and such that for any fixed value of x, the dependent variable is related to x through the model equation y 0 1x 2 is a random variable (called the random deviation) with E( ) 0 and V ( ) 2 . Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Linear Regression Model (x1,y1) True regression line y 0 1x 1 x1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Distribution of Normal, mean = 0, standard deviation 0 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Distribution of Y for Different Values of x 0 1x3 0 1x2 y 0 1x 0 1x1 x1 x2 x3 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.2 Estimating Model Parameters Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Principle of Least Squares The vertical deviation of the point (xi,yi) from the line y = b0 + b1x is yi – (b0 + b1xi) The sum of squared vertical deviations from the points ( x1 , y1 ), ( x2 , y2 ),..., ( xn , yn ) to the line is: n f (b0 , b1 ) yi b0 b1xi 2 i 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Principle of Least Squares The least-squares (regression) line for the data is given by y ˆ0 ˆ1x where b1 ˆ1 x y x y / n x x / n i i i 2 2 i i and b0 ˆ0 i ˆ x y i 1i n y ˆ1 x Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Ex. Find the equation of least-squares for the data (1, 2) (2,3) (3, 7) Sum: x y xy x2 1 2 2 1 2 3 6 4 3 7 21 9 12 29 14 6 3 29 6 12 ˆ 1 = 2.5 2 3 14 6 ˆ0 12 2.5 6 3 = –1 y 1 2.5 x Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Fitted Values and Residuals The fitted (predicted) values yˆ1,..., yˆ n are obtained by substituting x1,..., xn into the equation of the estimated regression line: yˆ1 ˆ0 ˆ1x1,..., yˆn ˆ0 ˆ1xn . The residuals are the vertical deviations y1 yˆ1,..., yn yˆ n from the estimated line. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Error Sum of Squares The error sum of squares, denoted SSE, is SSE= yi yˆi yi ˆ0 ˆ1xi 2 2 and the estimate of is 2 SSE ˆ s n2 2 2 yi yˆi 2 n2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Computational Formula A computational formula for the SSE, is SSE= 2 yi ˆ0 yi ˆ1 xi yi Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Total Sum of Squares The total sum of squares, denoted SST, is SST S yy yi y 2 2 yi yi / n 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Coefficient of Determination The coefficient of determination, denoted by r2, is given by SSE 2 r 1 SST It is interpreted as the proportion of observed y variation that can be explained by the simple linear regression model. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Regression Sum of Squares SSR = SST – SSE Regression sum of squares is interpreted as the amount of variation that is explained by the model. We have SSR r SST 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.3 Inferences About the Slope Parameter1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. ˆ1 ˆ ˆ is E ( 1. The mean of 1 1 ) ˆ1 1. 2. The variance and standard deviation are 2 2 2 ˆ V ( 1 ) ˆ 1 S xx s sˆ 1 S xx ˆ 1 S xx 3. ˆ1 has a normal distribution. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. T Variable The assumptions of the simple linear regression model imply that the standardized variable T ˆ1 1 S / S xx ˆ1 1 Sˆ 1 has a t distribution with n – 2 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Confidence Interval A 100(1 )% CI for the slope 1 of the true regression line is ˆ1 t / 2,n2 sˆ 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing Procedures H 0 : 1 10 Null hypothesis: Test statistic value: t ˆ1 10 sˆ 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing Procedures Alternative Hypothesis H a : 1 10 H a : 1 10 H a : 1 10 Rejection Region for Approx. Level Test t t , n 2 t t ,n 2 t t / 2,n 2 or t t / 2,n 2 A P-value based on n – 2 df can be calculated as in Chap 8 and 9. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing The model utility test is the test of H 0 : 1 0 versus H a : 1 0, in which case the test statistic value is the ratio t ˆ1 / s ˆ . 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. ANOVA Table Source of Variation df Sum of squares Mean Square f Regression 1 SSR SSR SSR SSE/(n 2) Error n–2 SSE SSE s n2 Total n–1 SST 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.4 Inferences Concerning Y x* and the Prediction of Future Y Values Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Yˆ Let Yˆ ˆ0 ˆ1x*, where x * is some fixed value of x. 1. The mean of Yˆ is E (Yˆ ) ˆ 0 ˆ x* 1 0 1x *. 2. Variance and standard deviation: 2 2 2 1 (x * x ) ˆ V (Y ) Yˆ S xx n Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Yˆ 2. (continued) 1 ( x * x )2 sYˆ s n S xx 3. Yˆ has a normal distibution. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. T Variable The variable T ˆ0 ˆ1x * (0 1x*) Sˆ ˆ 0 1 x* Yˆ ( 0 1x*) SYˆ has a t distribution with n – 2 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Confidence Interval A 100(1 )% CI for Y x* , the expected value of Y when x = x*, is ˆ0 ˆ1x * t / 2,n2 sˆ ˆ x* 0 1 yˆ t / 2,n2 sYˆ Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Prediction Interval A future value of Y is not a parameter but instead a random variable; its interval of plausible values is referred to as a prediction interval. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Prediction Interval A 100(1 )% PI for a future Y observation to be made when x = x*, is 2 1 ( x * x ) ˆ0 ˆ1x * t / 2,n2 s 1 n S xx yˆ t / 2,n2 s 2 2 sYˆ Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.5 Correlation Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Sample Correlation Coefficient The sample correlation coefficient, denoted r, of n pairs (x1,y1),…,(xn,yn) is r S xy xi x 2 yi y 2 S xy S xx S yy Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Ex. Find the correlation coefficient for the least-squares line from the points (1, 2) (2,3) (3, 7) n xy x y r n x x 2 2 n y y 2 2 3 29 6 12 3 14 6 3 62 12 2 2 = 0.9449 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r Important properties of r 1. The value of r does not depend on which of the two variables under study is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and y are measured. 3. 1 r 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r 4. r = 1 iff all (xi, yi) pairs lie on straight line with positive slope, and r = –1 iff all (xi, yi) pairs lie on a straight line with negative slope. 5. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Different Values of r r near 1 r near 0, no relationship r near -1 r near 0, nonlinear relationship Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. The Population Correlation Coefficient Cov( X , Y ) ( X ,Y ) X Y where ( x X )( y Y ) f ( x, y ) x y Cov( X , Y ) ( x X )( y Y ) f ( x, y )dxdy depending on whether (X,Y) is discrete or continuous. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Estimator ˆ R (X (X i i X )(Yi Y ) X) 2 (Y Y ) 2 i Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Assumption The joint probability distribution of (X,Y ) is specified by f ( x, y ) e (( x 1 ) / 1 )2 2 ( x 1 )( y 2 ) / 1 2 (( y 2 ) / 2 ) 2 /[2(1 2 )] 2 1 2 1 x 2 y f ( x, y ) is called the bivariate normal probability distribution. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Testing for the Absence of Correlation When H 0 : 0 is true, the test statistic: R n2 T 2 1 R Has a t distribution with n – 2 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing Alternative Hypothesis Ha : 0 Ha : 0 Ha : 0 Rejection Region for Approx. Level Test t t , n 2 t t ,n 2 t t / 2,n 2 or t t / 2,n 2 A P-value based on n – 2 df can be calculated as described previously. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Other Inferences Concerning When (X1, Y1),…,(Xn, Yn) is a sample from a bivariate normal distribution, the rv 1 1 R V ln 2 1 R has approximately a normal distribution with mean and variance 1 1 1 2 V ln V 2 1 n 3 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. The test statistic for testing H 0 : 0 1 V ln[(1 0 ) /(1 0 )] 2 Z 1/ n 3 Alternative Hypothesis H a : 0 H a : 0 H a : 0 Rejection Region for Level Test z z z z z z / 2 or z z / 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. CI for A 100(1 )% CI for is e 1 e 1 , 2c2 2c1 e 1 e 1 2 c1 2 c2 where c1 and c2 are the left and right endpoints, of the CI interval for V : z / 2 z / 2 ,v v n3 n3 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.