Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
5/24/2017 Multiple Linear Regression Instructor: Ron S. Kenett Email: [email protected] Course Website: www.kpa.co.il/biostat Course textbook: MODERN INDUSTRIAL STATISTICS, Kenett and Zacks, Duxbury Press, 1998 (c) 2001, Ron S. Kenett, Ph.D. 1 5/24/2017 Course Syllabus •Understanding Variability •Variability in Several Dimensions •Basic Models of Probability •Sampling for Estimation of Population Quantities •Parametric Statistical Inference •Computer Intensive Techniques •Multiple Linear Regression •Statistical Process Control •Design of Experiments (c) 2001, Ron S. Kenett, Ph.D. 2 5/24/2017 Simple Linear Regression Model Probabilistic Model: yi = b0 + b1xi + ei where yi = a value of the dependent variable, y xi = a value of the independent variable, x b0 = the y-intercept of the regression line b1 = the slope of the regression line ei = random error, the residual Deterministic Model: yˆ = b + b x i 0 1 i where b b , b b 0 0 1 1 and yˆi is the predicted value of y in contrast to the actual value of y. (c) 2001, Ron S. Kenett, Ph.D. 3 5/24/2017 Determining the Least Squares Regression Line Least Squares Regression Line: ŷ = b0 + b1x1 Slope ( x y ) – n x y i i b = 1 ( x 2 ) – n x 2 i y-intercept b0 = y – b1x (c) 2001, Ron S. Kenett, Ph.D. 4 5/24/2017 Interval Estimates Using the Regression Model Confidence Interval for the Mean of y places an upper and lower bound around the point estimate for the average value of y given x. Prediction Interval for an Individual y places an upper and lower bound around the point estimate for an individual value of y given x. (c) 2001, Ron S. Kenett, Ph.D. 5 To Form Interval Estimates 5/24/2017 The Standard Error of the Estimate, sy,x The standard deviation of the distribution of the data points above and below the regression line, distances between actual and predicted values of y, residuals, of e The square root of MSE given by ANOVA s y,x = (c) 2001, Ron S. Kenett, Ph.D. ( yi – yˆ)2 n–2 6 5/24/2017 Equations for the Interval Estimates Confidence Interval for the Mean of y 2 ˆy ± ta (s y,x) 1n + (x value – x) ( x )2 2 ( x 2) – ni i Prediction Interval for the Individual y (x value – + ŷ ± ta (sy,x ) 1 + 1 n ( 2 ( x 2 ) – i (c) 2001, Ron S. Kenett, Ph.D. x )2 x )2 i n 7 5/24/2017 Comparing the Two Intervals Notice that the confidence interval for the mean is much narrower than the prediction interval for the individual value. There is greater fluctuation among individual values than among group means. Both are centered y = 431.872 at the point estimate. | | | 0 100 Confidence Interval: Prediction Interval: (c) 2001, Ron S. Kenett, Ph.D. | | 200 | 194.1 | | 300 | | 400 | | 500 | | | 351.8 511.9 | 600 | | 700 | 669.6 8 Coefficient of Correlation 5/24/2017 A measure of the Direction of the linear relationship between x and y. If x and y are directly related, r > 0. If x and y are inversely related, r < 0. Strength of the linear relationship between x and y. The larger the absolute value of r, the more the value of y depends in a linear way on the value of x. (c) 2001, Ron S. Kenett, Ph.D. 9 5/24/2017 Coefficient of Determination A measure of the Strength of the linear relationship between x and y. The larger the value of r 2, the more the value of y depends in a linear way on the value of x. Amount of variation in y that is related to variation in x. Ratio of variation in y that is explained by the regression model divided by the total variation in y. (c) 2001, Ron S. Kenett, Ph.D. 10 5/24/2017 Testing for Linearity Key Argument: If the value of y does not change linearly with the value of x, then using the mean value of y is the best predictor for the actual value of y. This implies y = y is preferable. If the value of y does change linearly with the value of x, then using the regression model gives a better prediction for the value of y than using the mean of y. This implies y = yˆ is preferable. (c) 2001, Ron S. Kenett, Ph.D. 11 5/24/2017 Three Tests for Linearity 1. Testing the Coefficient of Correlation H0: r = 0 There is no linear relationship between x and y. H1: r ¹ 0 There is a linear relationship between x and y. Test Statistic: t = r 1 – r2 n– 2 2. Testing the Slope of the Regression Line H0: b1 = 0 There is no linear relationship between x and y. H1: b1 ¹ 0 There is a linear relationship between x and y. Test Statistic: t=s b 1 y,x x2 n( x )2 (c) 2001, Ron S. Kenett, Ph.D. 12 5/24/2017 Three Tests for Linearity 3. The Global F-test H0: There is no linear relationship between x and y. H1: There is a linear relationship between x and y. Test Statistic: SSR 1 F = MSR = MSE SSE (n – 2) Note: At the level of simple linear regression, the global F-test is equivalent to the t-test on b1. When we conduct regression analysis of multiple variables, the global F-test will take on a unique function. (c) 2001, Ron S. Kenett, Ph.D. 13 A General Test of b1 5/24/2017 Testing the Slope of the Population Regression Line Is Equal to a Specific Value. H0: b1 = b10 The slope of the population regression line is b10. H1: b1 b10 The slope of the population regression line is not b10. Test Statistic: t= s (c) 2001, Ron S. Kenett, Ph.D. b – b 1 10 y, x x 2 – n( x ) 2 14 5/24/2017 The Multiple Regression Model Probabilistic Model yi = b0 + b1x1i + b2x2i + ... + bkxki + ei where yi = a value of the dependent variable, y b0 = the y-intercept x1i, x2i, ... , xki = individual values of the independent variables, x1, x2, ... , xk b1, b2 ,... , bk = the partial regression coefficients for the independent variables, x1, x2, ... , xk ei = random error, the residual (c) 2001, Ron S. Kenett, Ph.D. 15 5/24/2017 The Multiple Regression Model Sample Regression Equation yˆ = b0 + b1x1i + b2x2i + ... + bkxki i where yˆ = the predicted value of the dependent i variable, y, given the values of x1, x2, ... , xk b0 = the y-intercept x1i, x2i, ... , xki = individual values of the independent variables, x1, x2, ... , xk b1, b2, ... , bk = the partial regression coefficients for the independent variables, x1, x2, ... , xk (c) 2001, Ron S. Kenett, Ph.D. 16 5/24/2017 The Amount of Scatter in the Data The multiple standard error of the estimate ( yi – ŷi)2 se = n – k –1 where yi = each observed value of y in the data set yˆ = the value of y that would have been i estimated from the regression equation n = the number of data values in the set k = the number of independent (x) variables measures the dispersion of the data points around the regression hyperplane. (c) 2001, Ron S. Kenett, Ph.D. 17 5/24/2017 Approximating a Confidence Interval for a Mean of y A reasonable estimate for interval bounds on the conditional mean of y given various x values is generated by: ŷ ± t s e n where yˆ = the estimated value of y based on the set of x values provided t = critical t value, (1–a)% confidence, df = n – k – 1 se = the multiple standard error of the estimate (c) 2001, Ron S. Kenett, Ph.D. 18 5/24/2017 Approximating a Prediction Interval for an Individual y Value A reasonable estimate for interval bounds on an individual y value given various x values is generated by: ŷ ± t se where yˆ = the estimated value of y based on the set of x values provided t = critical t value, (1–a)% confidence, df = n – k – 1 se = the multiple standard error of the estimate (c) 2001, Ron S. Kenett, Ph.D. 19 5/24/2017 Coefficient of Multiple Determination The proportion of variance in y that is explained by the multiple regression equation is given by: 2 y ŷ ( – ) S 2 SSE SSR i i R = 1– = 1– = 2 SST SST S(y – y ) i (c) 2001, Ron S. Kenett, Ph.D. 20 5/24/2017 Coefficients of Partial Determination For each independent variable, the coefficient of partial determination denotes the proportion of total variation in y that is explained by that one independent variable alone, holding the values of all other independent variables constant. The coefficients are reported on computer printouts. (c) 2001, Ron S. Kenett, Ph.D. 21 5/24/2017 Testing the Overall Significance of the Multiple Regression Model Is using the regression equation to predict y better than using the mean of y ? The Global F-Test I. H0: b1 = b2 = ... = bk = 0 The mean of y is doing as good a job at predicting the actual values of y as the regression equation. H1: At least one bi does not equal 0. The regression model is doing a better job of predicting actual values of y than using the mean of y. (c) 2001, Ron S. Kenett, Ph.D. 22 5/24/2017 Testing Model Significance II. Rejection Region Given a and numerator df = k, denominator df = n – k – 1 Decision Rule: If F > critical value, reject H0. 1a (c) 2001, Ron S. Kenett, Ph.D. a 23 5/24/2017 Testing Model Significance III. Test Statistic where F = SSR = SST – SSE SSR k SSE (n–k–1) SST = S ( yi – y )2 SSE = S ( y – ŷ )2 i If H0 is rejected: • At least one bi differs from zero. •The regression equation does a better job of predicting the actual values of y than using the mean of y. (c) 2001, Ron S. Kenett, Ph.D. 24 5/24/2017 Testing the Significance of a Single Regression Coefficient Is the independent variable xi useful in predicting the actual values of y ? The Individual t-Test I. H0: bi = 0 The dependent variable (y) does not depend on values of the independent variable xi. (This can, with reason, be structured as a one- tail test instead.) H1: bi 0 The dependent variable (y) does change with the values of the independent variable xi. (c) 2001, Ron S. Kenett, Ph.D. 25 5/24/2017 Testing the Impact on y of a Single Independent Variable II. Rejection Region Given a and df = n – k – 1 Decision Rule: If t > critical value or t < critical value, reject H0. (c) 2001, Ron S. Kenett, Ph.D. Do Not Reject H Reject H a/2 Reject H 1a a/2 26 5/24/2017 Testing the Impact on y of a Single Independent Variable III. Test Statistic b – 0 t = is b i where bi = estimate for bi for the multiple regression equation s b = the standard deviation of bi i If H0 is rejected: • The dependent variable (y) does change with the independent variable (xi). (c) 2001, Ron S. Kenett, Ph.D. 27