Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
12 Chapter Simple Regression Visual Displays and Correlation Analysis Simple Regression Regression Terminology Ordinary Least Squares Formulas Tests for Significance Analysis of Variance: Overall Fit Confidence and Prediction Intervals for Y Violations of Assumptions Unusual Observations Other Regression Problems McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Visual Displays and Correlation Analysis Visual Displays – Scatter Plot - displays each observed data pair (xi, yi) as a dot on an X/Y grid Figure 12.1 12-2 Visual Displays and Correlation Analysis Correlation Analysis • The sample correlation coefficient (r) measures the degree of linearity in the relationship between X and Y. -1 < r < +1 Strong negative relationship Strong positive relationship • r = 0 indicates no linear relationship • To test the hypothesis H0: r = 0, the test statistic is: 12-3 Visual Displays and Correlation Analysis Tests for Significance calc • • • The critical value ta is obtained from Appendix D using n = n – 2 degrees of freedom for any a. Equivalently, you can calculate the critical value for the correlation coefficient using This method gives a benchmark for the correlation coefficient. 12-4 Simple Regression What is Simple Regression? • • • • • • Simple Regression analyzes the relationship between two variables. It specifies one dependent (response) variable and one independent (predictor) variable. This hypothesized relationship may be linear, quadratic, or whatever. Unknown parameters are b0 Intercept b1 Slope The assumed model for a linear relationship is yi = b0 + b1xi + ei for all observations (i = 1, 2, …, n) The error term is not observable, is assumed normally distributed with mean of 0 and standard deviation s. 12-5 Regression Terminology Models and Parameters • The fitted model used to predict the expected value of Y for a given value of X is ^ yi = b0 + b1xi • • • The fitted coefficients which can be computed using formulas or technology are b0 the estimated intercept b1 the estimated slope ^i. Residual is ei = yi - y Residuals may be used to estimate s, the standard deviation of the errors. 12-6 Ordinary Least Squares Formulas Slope and Intercept • The OLS estimator for the slope is: or • The OLS estimator for the intercept is: 12-7 Ordinary Least Squares Formulas Coefficient of Determination • R2 is a measure of relative fit based on a comparison of the regression sum of squares (SSR) and the total sum of squares (SST). 0 < R2 < 1 • • Often expressed as a percent, an R2 = 1 (i.e., 100%) indicates perfect fit. In a simple regression, R2 = (r)2 12-8 12-8 Tests for Significance • Confidence Intervals for the true Slope: • Confidence interval for the true Intercept: • Hypothesis Tests • If b1 = 0, then X cannot influence Y and the regression model collapses to a constant b0 plus random error. 12-9 12-9 Tests for Significance • The hypotheses to be tested using technology or formulas are: 12-10 12-10 Analysis of Variance: Overall Fit F Statistic for Overall Fit • For a simple regression, the F statistic is calc • • For a given sample size, a larger F statistic indicates a better fit. Reject H0 if F > F1,n-2 from Appendix F for a given significance level a or if p-value < a. 12-11 12-11 Confidence and Prediction Intervals for Y How to Construct an Interval Estimate for Y • Confidence Interval for the conditional mean of Y • Prediction interval for individual values of Y is 12-12 12-12 Violations of Assumptions Three Important Assumptions 1. 2. 3. The errors are normally distributed. The errors have constant variance (i.e., they are homoscedastic) The errors are independent (i.e., they are nonautocorrelated). • • The error ei is unobservable. The residuals ei from the fitted regression give clues about the violation of these assumptions. Leverage and Influence • • • A high leverage statistic indicates the observation is far from the mean of X. These observations are influential because they are at the “ end of the lever.” The leverage for observation i is denoted hi 12-13 12-13 Unusual Observations A leverage that exceeds 3/n is unusual. Studentized Deleted Residuals • Studentized deleted residuals are another way to identify unusual observations. • A studentized deleted residual whose absolute value is 2 or more may be considered unusual. • A studentized deleted residual whose absolute value is 2 or more is an outlier. 12-14 12-14 Other Regression Problems • Outliers – can cause loss of fit and other problems. • Model Misspecification – occurs when a relevant predictor has been omitted. • Ill-Conditioned Data – can cause loss of regression accuracy. • Spurious Correlation – occurs when two variables appear related because of the way they are defined. • Model Form and Variable Transforms – sometimes linear relationships will not work and transformations are necessary in order to do any analysis. 12-15