Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interaction (statistics) wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Forecasting wikipedia , lookup
Linear regression wikipedia , lookup
Forecasting with Regression Causal, Explanatory Forecasting • Assumes cause-and-effect relationship between system inputs and its output System Inputs Forecasting with Regression Analysis Cause + Effect Relationship Output • The job of forecasting: Richard S. Barr 1 Regression Analysis • Determines and measures the relationship between two or more variables – “Simple” linear regression: 2 variables 2 Simple Linear Regression • Evaluates the relationship (goingtogether) of two variables – Dependent variable (Y) – Independent variable (X) • Relationship depicted by a straight line model: Y=a+bX – Multiple linear regression: 3+ variables 3 4 Which is Independent? Forecasting • Build the model using historical data • Then use knowledge of the independent variable (X) to forecast the value of the dependent variable (Y) • Assumptions: – The relationship between X and Y is strong – The future follows the past 5 • Sales Advertising • Age wear Equipment • Demand Time • Price Units sold 6 regr-1 Forecasting with Regression Regression Forecasting Steps 1. Plot the scatter diagram 2. Compute the regression equation 3. Forecast Y using the regression model and estimates of X Scatter Diagram • The first step for simple regression modeling • Used to – Display historical raw data – Spot patterns of relationships • Will help you determine if regression is appropriate 7 Types of Relationships Direct linear • Positive relationship • As X increases, Y Y 8 Types of Relationships Inverse linear • Negative Y relationship • As X increases, Y tends to decrease by a constant amount tends to increase by a constant amount X X 9 Types of Relationships No correlation • Change in X tells nothing about Y Y 10 Types of Relationships Nonlinear relationship • As X increases, Y Y changes by a varying amount X 11 X 12 regr-2 Forecasting with Regression Regression Model • Expresses the relationship between X and Y as a straight line: Yc = a + b X (the regression line) where – Yc = estimated average Y for a given X – X = actual value of independent variable – a = estimated Y-intercept (if X=0) – b = estimated slope of regression line Regression Line Y slo b= Yc = a + bX pe slope = a change in Y change in X X 13 14 Purposes for the Regression • Provides a mathematical definition of the relationship – Precise, accuracy depends on data fit • Is a standard of perfect correlation – Can compare line with actual data values – If all values on the line, perfect correlation • Is a model for forecasting Y using X – Plug an X-value into: Yc = a + bX Which Line is Best? • There are many possibilities for a and b – Each defines a different line and model • To evaluate mathematically, let: – Yi = historical value of Y for a given Xi – Yc = calculated Y using Xi in regression line – (Yi-Yc) = deviation, error between actual and model forecast 15 Measuring Goodness of Fit • Measuring the fit of the line to the data: – Sum of the deviations 16 Measuring Goodness of Fit – Sum of the squared deviations n ∑ (Y − Y ) n ∑ (Yi − Yc ) i =1 i 2 c i =1 • Eliminates the sign problem • Is the generally accepted least squares criterion • Is 0 for any line going through (X,Y), due to +/- cancellations 17 18 regr-3 Forecasting with Regression Mail Order Sales vs. Advertising Least-Squares Regression Line • To minimize the squared deviations use: Date of Advertising Sept. 9 Sept. 26 Oct. 2 Oct. 9 Oct. 16 Oct. 23 ∑ ( XY ) − n X Y b= ∑ ( X ) − n( X ) 2 2 a = Y −bX where: n = number of data points $ Spent on Advertising $1,700 3,000 2,000 1,500 600 1,500 $ Sales in Next Week $60,000 110,000 85,000 55,000 30,000 60,000 X , Y = mean of X i 's,Yi 's ∑ ( XY ) = sum of { X × Y } ∑ ( X ) = sum of {X 's squared} i i 2 i 19 Scatter Plot Y, Sales ($1000s) 120 100 80 60 40 20 0 $0 $1 $2 $3 $4 X, Advertising ($000s) 20 Computing the Regression Line Xi Advert $1000s 1.7 3.0 2.0 1.5 0.6 1.5 Yi Sales $1000s 60 110 85 55 30 60 21 Step 1: Sum Column 1 for ΣX (1) Xi Advert $1000s 1.7 3.0 2.0 1.5 0.6 1.5 22 Step 2: Sum Column 2 for ΣY (1) Xi Advert $1000s 1.7 3.0 2.0 1.5 0.6 1.5 Yi Sales $1000s 60 110 85 55 30 60 23 (2) Yi Sales $1000s 60 110 85 55 30 60 24 regr-4 Forecasting with Regression Step 3: (1)•(2)=(3), Sum for ΣXY (1) Xi Advert $1000s 1.7 3.0 2.0 1.5 0.6 1.5 (2) Yi Sales $1000s 60 110 85 55 30 60 Step 4: (1)2=(4), Sum for ΣX2 (1) Xi Advert $1000s 1.7 3.0 2.0 1.5 0.6 1.5 (3) XY (1)x(2) ______ (2) Yi Sales $1000s 60 110 85 55 30 60 25 Step 5: Compute the Mean of X X= ∑X n Step 6: Compute the Mean of Y ∑Y i n 27 28 Compute b b= ∑ ( XY ) − n XY ∑ ( X ) − n( X ) 2 (4) X2 (1)2 ______ 26 Y = i (3) XY (1)x(2) ______ Compute a a = Y − bX 2 29 30 regr-5 Forecasting with Regression The Regression Equation • The resultant equation: – Yc = 7.455 + 34.49X • Interpretation and reasonableness check: – a = 7.455 = – b = 34.49 = • Forecast sales with $1800 advertising: Evaluating the Model How Well Did We Do? 31 32 Compare Actuals with Estimates Xi Yi 1.7 3.0 2.0 1.5 0.6 1.5 60 110 85 55 30 60 Model Estimate Yc 66.09 110.93 76.44 59.19 28.15 59.19 Error (Y-Yc) -6.09 -0.93 8.56 -4.19 1.85 0.81 Error2 (Y-Yc)2 37.11 0.87 73.28 17.58 4.42 0.65 Correlation Analysis Measures the degree of association between two variables 33 Measuring Correlation • We compare two approaches to estimating or forecasting Y for a given X: – Using the mean of Y – Using our least-squares regression line 34 Variation Analysis • We could use to estimate Y Y (for any X) and, on Y average, be ok _ • Can regression Y do better? X 35 36 regr-6 Forecasting with Regression Variation Analysis • Let’s look at variations around Y the regression line y1 to see how much better it explains the Y’s than the mean _ Y (x1,y1) Yc x1 Explained Deviation • Explained deviation from the mean: – (Yc-Y) Y – Deviation “explained” by the regression line Y y1 (x1,y1) Yc Yc1 _ Y } Explained Deviation x1 X 37 38 Unexplained Deviation • Deviation from the mean not explained Y by the regression y1 Unexplained line: deviation Yc1 – (y1-Yc) _ Y (x1,y1) Yc { } Explained deviation x1 Total Deviation • The total deviation from the mean = Y explained + y1 unexplained Total Yc1 deviation _ Y (x1,y1) Yc { {} x1 X 39 Portion Explained, r2 • Sample coefficient of determination: • Total variation = Explained + Unexplained variation • Total = Explained + Unexplained ∑ (Y − Y ) = ∑ (Y 2 i c X 40 Variation • Variation is the square of deviations from the mean of Y X r2 = Explained variation Total variation • The fraction of variation from the mean explained by the regression line r2 = ∑ (Y − Y ) ∑ (Y − Y ) 2 c 2 i − Y ) 2 + ∑ (Yi − Yc ) 2 41 42 regr-7 Forecasting with Regression Extreme Values of r2 • r2 =1 – Perfect linear correlation – All points are explained by the line – All points are on the line • r2 =0 – No correlation – The regression does not explain the data any better than the mean of Y – X provides no useful information about Y in this context Correlation • The correlation coefficient, r : r = ± r2 • Unitless • Sign: + if b>0, - if b<0 • Simply a different way of expressing the relationship (correlation) between two variables 43 44 Correlation Coefficient • r = +/-1 – Only if a perfect linear relationship Y=a+bX exists – All points on the line • Some think that it “looks better” than r2 – r2 = 0.36 – r = 0.60 Example Scatterplot A y 58 32 67 54 39 38 55 31 60 54 62 72 46 36 44 63 38 48 43 38 x 51 42 65 52 45 24 45 31 51 60 67 44 40 53 52 59 41 51 55 42 y x 45 Example Scatterplot B y 38 52 52 62 57 34 70 50 56 25 53 43 60 54 35 63 52 40 35 51 x 45 58 40 41 61 56 64 40 65 57 55 45 55 56 54 34 55 48 55 55 46 Correlation Coefficient • Shows – The direction of the relationship – The strength of association • Cautions – It only measures linear association – It is unstable with a small sample size – Is distorted by extreme values or by including different data sets in the analysis y x 47 48 regr-8 Forecasting with Regression Nonlinear Relationship Monkey Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Wt 55 45 35 39 53 41 51 35 57 57 45 47 35 49 43 51 31 53 47 51 Ht 29 27 17 29 31 21 31 13 37 41 45 35 25 25 31 33 29 27 17 45 49 50 Monkey & King Kong Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 KK 55 45 35 39 53 41 51 35 57 57 45 47 35 49 43 51 31 53 47 51 130 29 27 17 29 31 21 31 13 37 41 45 35 25 25 31 33 29 27 17 45 150 Multiple Regression Same concept, more variables 51 Multiple Regression Models • An extension of the simple case • Permits use of more variables to try to explain more variation • Example model: Y = a + b1 X 1 + b2 X 2 L 53 52 Real Estate Example • Monthly sales (Y) are related to – Mortgage rates (X1) – Number of salespersons (X2) • With simple regression models: – Y = a + bX1, r2 = 0.36 – Y = a + bX2, r2 = 0.25 • Multiple regression model – Y = a + b1X1+ b2X2, r2 = 0.49, not 0.61! 54 regr-9 Forecasting with Regression Real Estate Example • Why is not more variation explained? • Multicollinearity Total variation exists: – X1 is correlated Explained with X2 by X1 – We want independence of Explained by X2 the X’s (uncorrelated) MLR Software 55 56 MLR Input • • • • MLR Reports Title line Variables and observations Labels for variables, dependent last For each observation – Xij values, followed by Yj – Xijs in label order • Blanks separate all values and labels • Descriptive statistics • Correlation matrix and determinant • Regression equation, each variable: – Label – coeffcient – beta value – standard error of the coefficient – t-statistic and probability that bi = 0 57 58 MLR Reports • Analysis of variance – P(insignificant regression model) • Summary statistics – r2 – sy,x • Residual summary (optional) – Residuals (errors) – Graph Standard Error of the Estimate • The standard deviation of the observed values of Y from the regression line sy,x = ∑ (Y − Y ) c n−2 2 = ∑Y 2 − a ∑ Y − b∑ XY n−2 • On average, how the data varies around the regression line 59 60 regr-10 Forecasting with Regression Confidence Intervals • Using the 68-95-99.7 Rule of Normality – µ ± 1 σ includes 68% of all values – µ ± 2 σ includes 95% – µ ± 3 σ includes 99.7% • b ± Z sy,x gives confidence interval for a given probability and associated Zvalue – If Z=1, a 68% confidence that the interval contains the true regression coefficient 61 regr-11