Download Slide 1

Chapter 12 Simple Linear Regression and Correlation Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.1 The Simple Linear Regression Model Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Linear Relationship The simplest deterministic mathematical relationship between two variables x and y is a linear relationship y  0  1x. The set of pairs (x,y) for which y  0  1x determines a straight line. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Terminology The variable whose value is fixed by the experimenter, denoted x, is the independent (predictor, explanatory) variable. For a fixed x, the second variable will be a random variable Y with observed value y, referred to as the dependent (response) variable. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. The Simple Linear Regression Model There exists parameters 0 , 1 and  such that for any fixed value of x, the dependent variable is related to x through the model equation y  0  1x   2  is a random variable (called the random deviation) with E( )  0 and V ( )   2 . Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Linear Regression Model (x1,y1) True regression line y  0  1x 1 x1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Distribution of  Normal, mean = 0, standard deviation   0  Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Distribution of Y for Different Values of x 0  1x3 0  1x2 y  0  1x 0  1x1 x1 x2 x3 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.2 Estimating Model Parameters Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Principle of Least Squares The vertical deviation of the point (xi,yi) from the line y = b0 + b1x is yi – (b0 + b1xi) The sum of squared vertical deviations from the points ( x1 , y1 ), ( x2 , y2 ),..., ( xn , yn ) to the line is: n f (b0 , b1 )    yi   b0  b1xi  2 i 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Principle of Least Squares The least-squares (regression) line for the data is given by y  ˆ0  ˆ1x where b1  ˆ1 x y    x   y  / n    x    x  / n i i i 2 2 i i and b0  ˆ0  i ˆ x y    i 1i n  y  ˆ1 x Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Ex. Find the equation of least-squares for the data (1, 2) (2,3) (3, 7) Sum: x y xy x2 1 2 2 1 2 3 6 4 3 7 21 9 12 29 14 6 3  29    6 12  ˆ 1  = 2.5 2 3 14    6  ˆ0  12  2.5  6  3 = –1 y  1  2.5 x Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Fitted Values and Residuals The fitted (predicted) values yˆ1,..., yˆ n are obtained by substituting x1,..., xn into the equation of the estimated regression line: yˆ1  ˆ0  ˆ1x1,..., yˆn  ˆ0  ˆ1xn . The residuals are the vertical deviations y1  yˆ1,..., yn  yˆ n from the estimated line. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Error Sum of Squares The error sum of squares, denoted SSE, is   SSE=  yi  yˆi     yi  ˆ0  ˆ1xi    2 2 and the estimate of  is 2 SSE ˆ  s   n2 2 2   yi  yˆi  2 n2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Computational Formula A computational formula for the SSE, is SSE=  2 yi  ˆ0  yi  ˆ1  xi yi Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Total Sum of Squares The total sum of squares, denoted SST, is SST  S yy    yi  y    2 2 yi    yi  / n 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Coefficient of Determination The coefficient of determination, denoted by r2, is given by SSE 2 r  1 SST It is interpreted as the proportion of observed y variation that can be explained by the simple linear regression model. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Regression Sum of Squares SSR = SST – SSE Regression sum of squares is interpreted as the amount of variation that is explained by the model. We have SSR r  SST 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.3 Inferences About the Slope Parameter1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. ˆ1 ˆ ˆ  is E (  1. The mean of 1 1 )  ˆ1  1. 2. The variance and standard deviation are 2 2  2 ˆ V ( 1 )   ˆ  1 S xx s sˆ  1 S xx  ˆ  1  S xx 3. ˆ1 has a normal distribution. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. T Variable The assumptions of the simple linear regression model imply that the standardized variable T ˆ1  1 S / S xx  ˆ1  1 Sˆ 1 has a t distribution with n – 2 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Confidence Interval A 100(1   )% CI for the slope 1 of the true regression line is ˆ1  t / 2,n2  sˆ 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing Procedures H 0 : 1  10 Null hypothesis: Test statistic value: t ˆ1  10 sˆ 1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing Procedures Alternative Hypothesis H a : 1  10 H a : 1  10 H a : 1  10 Rejection Region for Approx. Level  Test t  t , n  2 t  t ,n  2 t  t / 2,n  2 or t  t / 2,n  2 A P-value based on n – 2 df can be calculated as in Chap 8 and 9. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing The model utility test is the test of H 0 : 1  0 versus H a : 1  0, in which case the test statistic value is the ratio t  ˆ1 / s ˆ .  1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. ANOVA Table Source of Variation df Sum of squares Mean Square f Regression 1 SSR SSR SSR SSE/(n  2) Error n–2 SSE SSE s  n2 Total n–1 SST 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.4 Inferences Concerning Y x* and the Prediction of Future Y Values Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Yˆ Let Yˆ  ˆ0  ˆ1x*, where x * is some fixed value of x. 1. The mean of Yˆ is E (Yˆ )  ˆ 0  ˆ x* 1  0  1x *. 2. Variance and standard deviation: 2  2 2 1 (x * x ) ˆ V (Y )   Yˆ      S xx   n Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Yˆ 2. (continued) 1 ( x *  x )2 sYˆ  s  n S xx 3. Yˆ has a normal distibution. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. T Variable The variable T ˆ0  ˆ1x * (0  1x*) Sˆ ˆ 0  1 x* Yˆ  ( 0  1x*)  SYˆ has a t distribution with n – 2 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Confidence Interval A 100(1   )% CI for Y x* , the expected value of Y when x = x*, is ˆ0  ˆ1x * t / 2,n2  sˆ ˆ x*   0 1  yˆ  t / 2,n2  sYˆ Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Prediction Interval A future value of Y is not a parameter but instead a random variable; its interval of plausible values is referred to as a prediction interval. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Prediction Interval A 100(1   )% PI for a future Y observation to be made when x = x*, is 2 1 ( x *  x ) ˆ0  ˆ1x * t / 2,n2  s 1   n S xx  yˆ  t / 2,n2  s 2 2  sYˆ Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. 12.5 Correlation Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Sample Correlation Coefficient The sample correlation coefficient, denoted r, of n pairs (x1,y1),…,(xn,yn) is r S xy   xi  x  2    yi  y  2  S xy S xx S yy Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Ex. Find the correlation coefficient for the least-squares line from the points (1, 2) (2,3) (3, 7) n   xy     x   y  r n   x    x 2 2  n  y    y  2 2 3  29    6 12  3 14    6   3  62   12  2 2 = 0.9449 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r Important properties of r 1. The value of r does not depend on which of the two variables under study is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and y are measured. 3. 1  r  1 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Properties of r 4. r = 1 iff all (xi, yi) pairs lie on straight line with positive slope, and r = –1 iff all (xi, yi) pairs lie on a straight line with negative slope. 5. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Different Values of r r near 1 r near 0, no relationship r near -1 r near 0, nonlinear relationship Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. The Population Correlation Coefficient Cov( X , Y )    ( X ,Y )   X  Y where   ( x   X )( y  Y ) f ( x, y )  x y Cov( X , Y )        ( x   X )( y  Y ) f ( x, y )dxdy    depending on whether (X,Y) is discrete or continuous. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Estimator ˆ  R  (X (X i i  X )(Yi  Y )  X) 2  (Y  Y ) 2 i Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Assumption The joint probability distribution of (X,Y ) is specified by f ( x, y )  e  (( x  1 ) /  1 )2  2  ( x  1 )( y   2 ) /  1 2  (( y   2 ) /  2 ) 2  /[2(1  2 )]   2 1 2 1     x   2   y   f ( x, y ) is called the bivariate normal probability distribution. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Testing for the Absence of Correlation When H 0 :   0 is true, the test statistic: R n2 T 2 1 R Has a t distribution with n – 2 df. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Hypothesis-Testing Alternative Hypothesis Ha :   0 Ha :   0 Ha :   0 Rejection Region for Approx. Level  Test t  t , n  2 t  t ,n  2 t  t / 2,n  2 or t  t / 2,n  2 A P-value based on n – 2 df can be calculated as described previously. Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Other Inferences Concerning  When (X1, Y1),…,(Xn, Yn) is a sample from a bivariate normal distribution, the rv 1  1 R  V  ln   2  1 R  has approximately a normal distribution with mean and variance 1  1   1 2 V  ln   V  2  1   n 3 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. The test statistic for testing H 0 :   0 1 V  ln[(1  0 ) /(1  0 )] 2 Z 1/ n  3 Alternative Hypothesis H a :   0 H a :   0 H a :   0 Rejection Region for Level  Test z  z z   z z  z / 2 or z   z / 2 Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. CI for  A 100(1   )% CI for  is  e 1 e 1  , 2c2  2c1   e 1 e 1 2 c1 2 c2 where c1 and c2 are the left and right endpoints, of the CI interval for V : z / 2 z / 2   ,v  v   n3 n3   Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1