Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Forecasting wikipedia , lookup
German tank problem wikipedia , lookup
Data assimilation wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Confidence interval wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Regression analysis wikipedia , lookup
Chapter Topics • • • • • • • Types of Regression Models Determining the Simple Linear Regression Equation Measures of Variation in Regression and Correlation Assumptions of Regression and Correlation Residual Analysis and the Durbin-Watson Statistic Estimation of Predicted Values Correlation - Measuring the Strength of the Association Purpose of Regression and Correlation Analysis • Regression Analysis is Used Primarily for Prediction A statistical model used to predict the values of a dependent or response variable based on values of at least one independent or explanatory variable Correlation Analysis is Used to Measure Strength of the Association Between Numerical Variables Types of Regression Models Positive Linear Relationship Negative Linear Relationship Relationship NOT Linear No Relationship Simple Linear Regression Model • Relationship Between Variables Is a Linear Function • The Straight Line that Best Fit the Data Random Error Y intercept Yi 0 1 X i i Dependent (Response) Variable Slope Independent (Explanatory) Variable Sample Linear Regression Model Y i b0 b1X i Yi = Predicted Value of Y for observation i Xi = Value of X for observation i b0 = Sample Y - intercept used as estimate of the population 0 b1 = Sample Slope used as estimate of the population 1 Simple Linear Regression Equation: Example You wish to examine the relationship between the square footage of produce stores and its annual sales. Sample data for 7 stores were obtained. Find the equation of the straight line that fits the data best Store Square Feet Annual Sales ($000) 1 2 3 4 5 6 7 1,726 1,542 2,816 5,555 1,292 2,208 1,313 3,681 3,395 6,653 9,543 3,318 5,563 3,760 Equation for the Best Straight Line Y i b0 b1 X i 1636 . 415 1 . 487 X i From Excel Printout: C o e ffi c i e n ts I n te r c e p t 1 6 3 6 .4 1 4 7 2 6 X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7 Graph of the Best Straight Line Annua l Sa le s ($000) 12000 10000 8000 6000 4000 2000 0 0 1000 2000 3000 4000 S q u a re F e e t 5000 6000 Interpreting the Results Yi = 1636.415 +1.487Xi The slope of 1.487 means for each increase of one unit in X, the Y is estimated to increase 1.487units. For each increase of 1 square foot in the size of the store, the model predicts that the expected annual sales are estimated to increase by $1487. Measures of Variation: The Sum of Squares SST = Total Sum of Squares •measures_the variation of the Yi values around their mean Y SSR = Regression Sum of Squares •explained variation attributable to the relationship between X and Y SSE = Error Sum of Squares •variation attributable to factors other than the relationship between X and Y Measures of Variation: The Sum of Squares SSE =(Yi - Yi )2 Y _ SST = (Yi - Y)2 _ SSR = (Yi - Y)2 Xi _ Y X The Sum of Squares: Example Excel Output for Produce Stores df SS R e g r e ssi o n 1 30380456.12 R e si d u a l 5 1871199.595 T o ta l 6 32251655.71 SSR SSE SST The Coefficient of Determination r2 = SSR SST = regression sum of squares total sum of squares Measures the proportion of variation that is explained by the independent variable X in the regression model Coefficients of Determination (r2) and Correlation (r) Y r2 = 1, r = +1 Y r2 = 1, r = -1 ^=b +b X Y i ^=b +b X Y i 0 1 i 0 X Yr2 = .8, r = +0.9 X Y ^=b +b X Y i 0 1 i X 1 i r2 = 0, r = 0 ^=b +b X Y i 0 1 i X Standard Error of Estimate Syx SSE n2 n = ( Yi Yi ) i 1 2 n2 The standard deviation of the variation of observations around the regression line Measures of Variation: Example Excel Output for Produce Stores R e g r e ssi o n S ta ti sti c s M u lt ip le R R S q u a re 0 .9 4 1 9 8 1 2 9 A d ju s t e d R S q u a re 0 .9 3 0 3 7 7 5 4 S t a n d a rd E rro r 6 1 1 .7 5 1 5 1 7 O b s e r va t i o n s r2 = .94 0 .9 7 0 5 5 7 2 7 94% of the variation in annual sales can be explained by the variability in the size of the store as measured by square footage Syx Linear Regression Assumptions For Linear Models • 1.Normality – – • • Y Values Are Normally Distributed For Each X Probability Distribution of Error is Normal 2.Homoscedasticity (Constant Variance) 3.Independence of Errors Variation of Errors Around the Regression Line f(e) y values are normally distributed around the regression line. For each x value, the “spread” or variance around the regression line is the same. Y X2 X1 X Regression Line Residual Analysis • Purposes – – • Examine Linearity Evaluate violations of assumptions Graphical Analysis of Residuals – Plot residuals Vs. Xi values • – Difference between actual Yi & predicted Yi Studentized residuals: • Allows consideration for the magnitude of the residuals Residual Analysis for Linearity Not Linear e Linear e X X Residual Analysis for Homoscedasticity Homoscedasticity SR Heteroscedasticity SR X Using Standardized Residuals X The Durbin-Watson Statistic •Used when data is collected over time to detect autocorrelation (Residuals in one time period are related to residuals in another period) •Measures Violation of independence assumption n D ( ei ei 1 ) i 2 n 2 ei i 1 2 Should be close to 2. If not, examine the model for autocorrelation. Residual Analysis for Independence Not Independent SR Independent SR X X Inferences about the Slope: t Test • t Test for a Population Slope Is a Linear Relationship Between X & Y ? •Null and Alternative Hypotheses H0: 1 = 0 (No Linear Relationship) H1: 1 0 (Linear Relationship) •Test Statistic: b1 1 t Where Sb 1 S b1 SYX n 2 ( Xi X ) i 1 and df = n - 2 Example: Produce Stores Data for 7 Stores: Store 1 2 3 4 5 6 7 Square Feet Annual Sales ($000) 1,726 1,542 2,816 5,555 1,292 2,208 1,313 3,681 3,395 6,653 9,543 3,318 5,563 3,760 Regression Model Obtained: Yi = 1636.415 +1.487Xi The slope of this model is 1.487. Is there a linear relationship between the square footage of a store and its annual sales? Inferences about the Slope: t Test Test Statistic: • H0: 1 = 0 • H1: 1 0 a .05 •df 7 - 2 = 7 •Critical Value(s): Reject .025 From Excel Printout t S tat I n te r c e p t 3.6244333 0.0151488 9.009944 0.0002812 X V a ria b le 1 Decision: Reject H0 Reject Conclusion: .025 -2.5706 0 2.5706 P-valu e t There is evidence of a relationship. Inferences about the Slope: Confidence Interval Example Confidence Interval Estimate of the Slope b1 tn-2 Sb1 Excel Printout for Produce Stores L o w er 95% I n te r c e p t U p p er 95% 475.810926 2797.01853 X V a r i a b l e 11 . 0 6 2 4 9 0 3 7 1.91077694 At 95% level of Confidence The confidence Interval for the slope is (1.062, 1.911). Does not include 0. Conclusion: There is a significant linear relationship between annual sales and the size of the store. Estimation of Predicted Values Confidence Interval Estimate for mXY The Mean of Y given a particular Xi Standard error of the estimate Ŷi t n 2 Syx t value from table with df=n-2 Size of interval vary according to distance away from mean, X. 1 ( Xi X ) n n ( X X )2 i 2 i 1 Estimation of Predicted Values Confidence Interval Estimate for Individual Response Yi at a Particular Xi Addition of this 1 increased width of interval from that for the mean Y Ŷi t n 2 Syx 1 ( Xi X ) 1 n n ( X X )2 i 2 i 1 Interval Estimates for Different Values of X Y Confidence Interval for the mean of Y Confidence Interval for a individual Yi _ X X A Given X Example: Produce Stores Data for 7 Stores: Store Square Feet Annual Sales ($000) 1 2 3 4 5 6 7 1,726 1,542 2,816 5,555 1,292 2,208 1,313 3,681 3,395 6,653 9,543 3,318 5,563 3,760 Predict the annual sales for a store with 2000 square feet. Regression Model Obtained: Yi = 1636.415 +1.487Xi Estimation of Predicted Values: Example Confidence Interval Estimate for Individual Y Find the 95% confidence interval for the average annual sales for stores of 2,000 square feet Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000) X = 2350.29 Ŷi t n 2 Syx SYX = 611.75 1 ( X i X )2 n n ( X X )2 i i 1 tn-2 = t5 = 2.5706 = 4610.45 980.97 Confidence interval for mean Y Estimation of Predicted Values: Example Confidence Interval Estimate for mXY Find the 95% confidence interval for annual sales of one particular store of 2,000 square feet Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000) X = 2350.29 Ŷi t n 2 Syx SYX = 611.75 tn-2 = t5 = 2.5706 1 ( X i X )2 1 n = 4610.45 1853.45 n ( X X )2 Confidence interval for i i 1 individual Y Correlation: Measuring the Strength of Association • • Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’ Coefficient of Correlation Used – – – • Population correlation coefficient denoted r (‘Rho’) Values range from -1 to +1 Measures degree of association Is the Square Root of the Coefficient of Determination Test of Coefficient of Correlation • • • Tests If There Is a Linear Relationship Between 2 Numerical Variables Same Conclusion as Testing Population Slope 1 Hypotheses – – H0: r = 0 (No Correlation) H1: r 0 (Correlation)