Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Forecasting wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Regression analysis wikipedia , lookup
Chapter 14: Simple Linear Regression §14.1 Simple Linear Regression Model : The variable that is being predicted or explained by the Dependent Variable (Y) regression equation. Independent variable (X) : The variable that is doing the predicting or explaining Simple Linear regression : Regression involving only two variables X and Y. The relationship between the two variables is approximated by a straight line. Regression Model : Y = 0 + 1X + Describes how variable (y) is related to the variable (x) in simple linear regression. Regression Equation : E(y) = 0 + 1X Estimated Regression : Ŷ = b0 + b1X Equation Estimated from sample data. Also called the “fitted” line. §14.2 Least Squares Method Let Yi = ith observation of the response variable Ŷi. = Estimated value of Y = b0 + b1Xi Then, regression error for the ith observation ei = Yi – Ŷi. = Yi – (b0 + b1Xi) ”e” is also called the regression error or simply residual. Sum of squares due to error (SSE) = e2 = (Yi – Ŷi)2 The objective of Least Squares Method is to determine the values for b0 and b1 such that the value of SSE is minimized. The following formulas for b0 and b1 give the least value for SSE and therefore they are called least squares estimates. b1 = ̅)(Y - Y ̅) ∑(X - X ̅)2 ∑ (X - X i.e = ∑ XY - ( ∑ X ∑ Y)/n ∑ X 2 −(∑ X)2 /n , and ̅ − b1 X ̅ b0 = Y Excel functions: b1 = SLOPE(y_range,x_range), and b0 = INTERCEPT(y_range,x_range) §14.3 Coefficient of determination (r2) Coefficient of determination is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable Y that is explained by the estimated regression equation. Sums of Squares 250 Y ̂ Y ̂ Y-Y 200 ̅ Y-Y ̂− Y ̅ Y 150 Y 100 ̅ Y 50 0 0 5 10 15 20 X SST = Total Sum of Squares = (Yi – )2 SSR = Sum of Squares due to Regression = (Ŷi – )2 SSE = Sum of Squares due to Error = (Yi – Ŷi)2 SST = SSR + SSE ANOVA Table Source Sum of squares Regression SSR Error SSE Total SST df 1 n-2 n-1 r2 = Coefficient of determination = SSR SST Mean Square MSR = SSR/1 MSE = SSE/(n-2) F-test F = MSR/MSE 25 Correlation Coefficient (r) Correlation coefficient is a measure of the strength of linear association between two variables, X and Y. The value of r is always between –1 and +1. Negative values of r indicate negative linear relationship and positive values of r indicate positive linear relationship. The magnitude of r indicates the strength of the linear relationship with a value of 1 or -1 indicating perfect relationship. Values close to zero for r indicate that X and Y are not linearly related. r = sign of b1 √Coefficient of determination = sign of b1 √r 2 §14.4 Model assumptions Given the regression model: Y = 0 + 1X + 1. The error term is a random variable with a mean or expected value of zero; that is, E() = 0. This also means E(Y) = 0 + 1X. 2. The variance of , denoted by 2, is the same for all values of X. 3. The values of are independent. 4. The error term is a normally distributed random variable. Therefore, Y being a linear function of , is also a normally distributed random variable. §14.5 Testing for significance Ho: 1 = 0 Ha: 1 ≠ 0 t-Test Estimate for 2 = S2 = MSE = S = √MSE = √ SSE n-2 degrees of freedom = n-2 SSE n-2 Sb1 = Standard error of b1 = S ̅)2 √∑ (X - X b Test statistic = tcalc = S 1 b1 p-value = T.DIST.2T(ABS(tcalc),df), where df = n – 2 Confidence interval for 1: b1 ± t/2Sb1 F-test for significance of regression Mean Squares due to regression = MSR = SSR degrees of freedom for regression degrees of freedom for regression = number of independent variables Then, test statistic = Fcalc = MSR MSE p-value = F.DIST.RT(Fcalc,df1,df2), where df1 = df for MSR and df2 = df for MSE Caveat: Regression analysis, which can be used to identify how variables are associated with one another, cannot be used as evidence of a cause-and-effect relationship §14.6 Estimation and prediction ̂ = b0 + b1Xp Point estimate of Y for a given Xp = Y Confidence interval for the mean of Y for a given Xp value = ̂ Y ± t/2SŶp 1 where = SŶp = S√n + ̅)2 (X 𝑝 − X ∑(X − ̅ X)2 ̂ ± t/2SŶ Prediction interval for an individual Y for a given Xind value = Y ind where = SŶind = S√1 + 1 + n (X 𝑝 − ̅ X)2 ̅)2 ∑(X − X Estimation and prediction in Excel Prediction interval for an individual Y for a given Xind value = From the special regression output Lower 95% and Upper 95% value for Prediction Confidence Interval Estimate of yp : Ŷ± t/2 SŶp 2 where SŶp = √Sind − MSE and df = df of MSE, and Sind = From special regression output §14.8 Residual Analysis: Validating Model Assumptions Residual = Yi – Ŷi. Given the regression model: Y = 0 + 1X + Model assumptions: 1. The error term is a random variable with a mean or expected value of zero; that is, E() = 0. This also means E(Y) = 0 + 1X. 2. The variance of , denoted by 2, is the same for all values of X. 3. The values of are independent. 4. The error term is a normally distributed random variable. Therefore, Y being a linear function of , is also a normally distributed random variable. Four plots: 1. A plot of the residuals against values of the independent variable x 2. A plot of residuals against the predicted values of the dependent variable 3. A standardized residual plot 4. A normal probability plot Plot #1: Residual Plot Against X Panel A Panel B Panel C Plot #2: Residual Plot Against y For Simple Regression this plot is the same as Plot #1. Plot #3: Standardized Residuals against X Residual = Yi – Ŷi ̂i Y −Y Standardized residual = S i ̂i Yi −Y where SYi−Ŷi = Standard deviation of the ith residual = S√1 - hi 1 and hi = + n ̅)2 (Xi −X ̅)2 ∑(Xi −X The standardized residual plot can provide insight about the assumption that the error term has a normal distribution. If this assumption is satisfied, we should expect to see approximately 95% of the standardized residuals between -2 and + 2. Plot #4: Normal Probability Plot Another approach for determining the validity of the assumption that the error term has a normal distribution is the normal probability plot. §14.9 Residual Analysis: Outliers and Influential observations Outlier A data point or observation that is unusual compared to the remaining data and must be carefully examination The standardized residuals can be used to identify outliers. If an observation deviates greatly from the pattern of the rest of the data (i.e., an outlier) , the corresponding standardized residual will be large in absolute value. i. ii. iii. Outliers may represent erroneous data; if so, the data should be corrected. Outliers may signal a violation of model assumptions; if so, another model should be considered. Outliers may simply be unusual values that occurred by chance. In this case, they should be retained. Influential Observation An observation that has a strong influence or effect on the regression results. Such observations may have such a dramatic effect on the estimated regression equation. Leverage A measure of the influence an observation has on the regression results. Influential observations, i.e., observations with extreme values for the independent variables, have high leverage and are called high leverage points. 1 Leverage of an observation is given by and hi = + n ̅)2 (Xi −X ̅)2 ∑(Xi −X If hi > 6/n or 0.99 whichever is smaller, the observation is considered high leverage point. In these cases: i. If the observation is invalid, correct it and find new estimated regression equation. ii. If the observation is valid, obtain data on intermediate values of x to understand better the relationship between x and y.