Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Regression analysis wikipedia , lookup
Purpose of Regression Analysis • Regression analysis is used primarily to model causality and provide prediction – Predicts the value of a dependent (response) variable based on the value of at least one independent (explanatory) variable – Explains the effect of the independent variables on the dependent variable Types of Regression Models Positive Linear Relationship Negative Linear Relationship Relationship NOT Linear No Relationship Simple Linear Regression Model • Relationship between variables is described by a linear function • The change of one variable causes the change in the other variable • A dependency of one variable on the other Population Linear Regression Population regression line is a straight line that describes the dependence of the average value (conditional mean) of one variable on the other Population Slope Coefficient Population Y intercept Dependent (Response) Variable Random Error Yi X i i Population Regression YX Line (conditional mean) Independent (Explanatory) Variable Population Linear Regression (continued) Y (Observed Value of Y) = Yi X i i i = Random Error YX X i (Conditional Mean) X Observed Value of Y Sample Linear Regression Sample regression line provides an estimate of the population regression line as well as a predicted value of Y Sample Y Intercept Yi b0 b1 X i ei Sample Slope Coefficient Residual Sample Regression Line Ŷ b0 b1 X (Fitted Regression Line, Predicted Value) Sample Linear Regression (continued) • b0and b1are obtained by finding the values of b0 and b1 that minimizes the sum of the squared residuals n i 1 • • Yi Yˆi 2 n ei2 i 1 b0 provides an estimate of b1provides and estimate of Sample Linear Regression (continued) Yi b0 b1 X i ei Y ei Yi X i i b1 i YX X i b0 Observed Value Y i b0 b1 X i X Interpretation of the Slope and the Intercept • E Y | X 0 is the average value of Y when the value of X is zero. E Y | X • 1 measures the change in the X average value of Y as a result of a one-unit change in X. Interpretation of the Slope and the Intercept (continued) • b Eˆ Y | X 0 is the estimated average value of Y when the value of X is zero. Eˆ Y | X • b1 is the estimated change in X the average value of Y as a result of a one-unit change in X. Simple Linear Regression: Example You want to examine the linear dependency of the annual sales of produce stores on their size in square footage. Sample data for seven stores were obtained. Find the equation of the straight line that fits the data best. Store Square Feet 1 2 3 4 5 6 7 1,726 1,542 2,816 5,555 1,292 2,208 1,313 Annual Sales ($1000) 3,681 3,395 6,653 9,543 3,318 5,563 3,760 Scatter Diagram: Example Annua l Sa le s ($000) 12000 10000 8000 6000 4000 2000 0 0 1000 2000 3000 4000 S q u a re F e e t Excel Output 5000 6000 Equation for the Sample Regression Line: Example Yˆi b0 b1 X i 1636.415 1.487 X i From Excel Printout: C o e ffi c i e n ts I n te r c e p t 1 6 3 6 .4 1 4 7 2 6 X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7 Excel Output Regression Statistics Multiple R 0.970557 R Square 0.941981 Adjusted R Square 0.930378 Standard Error 611.7515 Observations 7 ANOVA df SS F 81.17909 0.000281 P-value 0.015149 0.000281 Lower 95% 475.8109 1.06249 Regression 1 30380456 30380456 Residual 5 1871200 374239.9 6 Coefficient s 1636.415 1.486634 32251656 Standard Error 451.4953 0.164999 t Stat 3.624433 9.009944 Total Intercept X Variable 1 Significance F MS Upper 95% 2797.019 1.910777 Annua l Sa le s ($000) Graph of the Sample Regression Line: Example 12000 10000 8000 6000 4000 2000 0 0 1000 2000 3000 4000 S q u a re F e e t 5000 6000 Interpretation of Results: Example Yˆi 1636.415 1.487 X i The slope of 1.487 means that for each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units. The model estimates that for each increase of one square foot in the size of the store, the expected annual sales are predicted to increase by $1487. How Good is the regression? • • • • • R2 Confidence Intervals Residual Plots Analysis of Variance Hypothesis (t) tests Measure of Variation: The Sum of Squares SST = Total = Sample Variability SSR Explained Variability + SSE + Unexplained Variability Measure of Variation: The Sum of Squares (continued) • SST = total sum of squares – Measures the variation of the Yi values around their mean Y • SSR = regression sum of squares – Explained variation attributable to the relationship between X and Y • SSE = error sum of squares – Variation attributable to factors other than the relationship between X and Y Measure of Variation: The Sum of Squares (continued) SSE =(Yi - Yi )2 Y _ SST = (Yi - Y)2 _ SSR = (Yi - Y)2 Xi _ Y X The Coefficient of Determination • SSR Regression Sum of Squares r SST Total Sum of Squares 2 • Measures the proportion of variation in Y that is explained by the independent variable X in the regression model Coefficients of Determination (r 2) and Correlation (r) Y r2 = 1, r = +1 Y r2 = 1, r = -1 ^=b +b X Y i ^=b +b X Y i 0 1 i 0 X Yr2 = .8, r = +0.9 X Y ^=b +b X Y i 0 1 i X 1 i r2 = 0, r = 0 ^=b +b X Y i 0 1 i X Linear Regression Assumptions 1. Linearity 2. Normality – Y values are normally distributed for each X – Probability distribution of error is normal 2. Homoscedasticity (Constant Variance) 3. Independence of Errors Residual Analysis • Purposes – Examine linearity – Evaluate violations of assumptions • Graphical Analysis of Residuals – Plot residuals vs. Xi , Yi and time Residual Analysis for Linearity Y Y X e X X e X Not Linear Linear Residual Analysis for Homoscedasticity Y Y X SR X SR X Heteroscedasticity X Homoscedasticity Variation of Errors around the Regression Line f(e) • Y values are normally distributed around the regression line. • For each X value, the “spread” or variance around the regression line is the same. Y X2 X1 X Sample Regression Line Residual Analysis:Excel Output for Produce Stores Example Observation 1 2 3 4 5 6 7 Excel Output Residual Plot 0 1000 2000 3000 4000 Square Feet 5000 6000 Predicted Y 4202.344417 3928.803824 5822.775103 9894.664688 3557.14541 4918.90184 3588.364717 Residuals -521.3444173 -533.8038245 830.2248971 -351.6646882 -239.1454103 644.0981603 171.6352829 Residual Analysis for Independence Graphical Approach Not Independent e Independent e Time Cyclical Pattern Time No Particular Pattern Residual is plotted against time to detect any autocorrelation Inference about the Slope: t Test • t test for a population slope – Is there a linear dependency of Y on X ? • Null and alternative hypotheses – H0: 1 = 0 – H1: 1 0 (no linear dependency) (linear dependency) • Test statistic b1 1 – t where Sb1 Sb1 – d. f . n 2 SYX n 2 ( X X ) i i 1 Example: Produce Store Data for Seven Stores: Store 1 2 3 4 5 6 7 Square Feet Annual Sales ($000) 1,726 1,542 2,816 5,555 1,292 2,208 1,313 3,681 3,395 6,653 9,543 3,318 5,563 3,760 Estimated Regression Equation: Yi = 1636.415 +1.487Xi The slope of this model is 1.487. Is square footage of the store affecting its annual sales? Inferences about the Slope: t Test Example H0: 1 = 0 H1: 1 0 .05 df 7 - 2 = 5 Critical Value(s): Reject .025 Test Statistic: From Excel Printout b1 Sb1 t Coefficients Standard Error t Stat P-value Intercept 1636.4147 451.4953 3.6244 0.01515 Footage 1.4866 0.1650 9.0099 0.00028 Decision: Reject H0 Reject .025 -2.5706 0 2.5706 t Conclusion: There is evidence that square footage affects annual sales. The Multiple Regression Model Relationship between 1 dependent & 2 or more independent variables is a linear function Population Y-intercept Population slopes Random Error Yi X1i X 2i k X ki i Yi b0 b1 X1i b2 X 2i bk X ki ei Dependent (Response) variable for sample Independent (Explanatory) variables for sample model Residual Population Multiple Regression Model Bivariate model Y Response Plane X1 Yi = 0 + 1X1i + 2X2i + i (Observed Y) 0 i X2 (X1i,X2i) Y|X = 0 + 1X1i + 2X2i Sample Multiple Regression Model Bivariate model Response Plane X1 Y Yi = b0 + b1X1i + b2X2i + ei (Observed Y) b0 ei X2 (X1i, X2i) ^ Yi = b0 + b1X1i + b2X2i Sample Regression Plane Simple and Multiple Regression Compared • Coefficients in a simple regression pick up the impact of that variable plus the impacts of other variables that are correlated with it and the dependent variable. • Coefficients in a multiple regression net out the impacts of other variables in the equation. Simple and Multiple Regression Compared:Example • Two simple regressions: – – Oil 0 1 Temp Oil 0 1 Insulation • Multiple regression: – Oil 0 1 Temp 2 Insulation Multiple Linear Regression Equation Too complicated by hand! Ouch! Interpretation of Estimated Coefficients • Slope (bi) – Estimated that the average value of Y changes by bi for each 1 unit increase in Xi holding all other variables constant (ceteris paribus) – Example: if b1 = -2, then fuel oil usage (Y) is expected to decrease by an estimated 2 gallons for each 1 degree increase in temperature (X1) given the inches of insulation (X2) • Y-intercept (b0) – The estimated average value of Y when all Xi = 0 Multiple Regression Model: Example 0 Develop a model for estimating heating oil used for a single family home in the month of January based on average temperature and amount of insulation in inches. Oil (Gal) Temp ( F) Insulation 275.30 40 3 363.80 27 3 164.30 40 10 40.80 73 6 94.30 64 6 230.90 34 6 366.70 9 6 300.60 8 10 237.80 23 10 121.40 63 3 31.40 65 10 203.50 41 6 441.10 21 3 323.00 38 3 52.50 58 10 Sample Multiple Regression Equation: Example Yˆi b0 b1 X1i b2 X 2i Excel Output Intercept X Variable 1 X Variable 2 bk X ki Coefficients 562.1510092 -5.436580588 -20.01232067 Yˆi 562.151 5.437 X1i 20.012 X 2i For each degree increase in temperature, the estimated average amount of heating oil used is decreased by 5.437 gallons, holding insulation constant. For each increase in one inch of insulation, the estimated average use of heating oil is decreased by 20.012 gallons, holding temperature constant. Confidence Interval Estimate for the Slope Provide the 95% confidence interval for the population slope 1 (the effect of temperature on oil consumption). b1 tn p 1Sb1 Coefficients Intercept 562.151009 X Variable 1 -5.4365806 X Variable 2 -20.012321 Lower 95% Upper 95% 516.1930837 608.108935 -6.169132673 -4.7040285 -25.11620102 -14.90844 -6.169 1 -4.704 The estimated average consumption of oil is reduced by between 4.7 gallons to 6.17 gallons per each increase of 10 F. Coefficient of Multiple Determination • Proportion of total variation in Y explained by all X variables taken together – 2 Y 12 k r SSR Explained Variation SST Total Variation • Never decreases when a new X variable is added to model – Disadvantage when comparing models Adjusted Coefficient of Multiple Determination • Proportion of variation in Y explained by all X variables adjusted for the number of X variables used – 2 adj r 2 1 1 rY 12 n 1 k n k 1 – Penalize excessive use of independent variables 2 r – Smaller than Y 12 k – Useful in comparing among models Coefficient of Multiple Determination Excel Output rY2,12 R e g re ssi o n S ta ti sti c s M u lt ip le R 0.982654757 R S q u a re 0.965610371 A d ju s t e d R S q u a re 0.959878766 S t a n d a rd E rro r 26.01378323 O b s e rva t io n s 15 SSR SST Adjusted r2 reflects the number of explanatory variables and sample size is smaller than r2 Interpretation of Coefficient of Multiple Determination • 2 Y ,12 r SSR .9656 SST – 96.56% of the total variation in heating oil can be explained by different temperature and amount of insulation • r .9599 2 adj – 95.99% of the total fluctuation in heating oil can be explained by different temperature and amount of insulation after adjusting for the number of explanatory variables and sample size Using The Model to Make Predictions Predict the amount of heating oil used for a home if the average temperature is 300 and the insulation is six inches. Yˆi 562.151 5.437 X 1i 20.012 X 2i 562.151 5.437 30 20.012 6 278.969 The predicted heating oil used is 278.97 gallons Testing for Overall Significance • Shows if there is a linear relationship between all of the X variables together and Y • Use F test statistic • Hypotheses: – H0: … k = 0 (no linear relationship) – H1: at least one i ( at least one independent variable affects Y ) • The null hypothesis is a very strong statement • Almost always reject the null hypothesis Test for Significance: Individual Variables • Shows if there is a linear relationship between the variable Xi and Y • Use t test statistic • Hypotheses: – H0: i 0 (no linear relationship) – H1: i 0 (linear relationship between Xi and Y) Residual Plots • Residuals vs. Yˆ – May need to transform variable • Residuals vs. X1 – May need to transform X1 variable • Residuals vs. time X2 X2 – May have autocorrelation Residual Plots: Example T e m p e ra tu re R e s id u a l P lo t Maybe some nonlinear relationship 60 Residuals 40 20 Insulation R esidual P lot 0 0 20 40 60 80 -20 -40 -60 0 No Discernable Pattern 2 4 6 8 10 12 The Quadratic Regression Model • Relationship between one response variable and two or more explanatory variables is a quadratic polynomial function • Useful when scatter diagram indicates nonlinear relationship • Quadratic model : –Y i 0 1 X 1i 2 X 12i i • The second explanatory variable is the square of the first variable Quadratic Regression Model (continued) Quadratic models may be considered when scatter diagram takes on the following shapes: Y Y 2 > 0 X1 Y 2 > 0 X1 Y 2 < 0 X1 2 = the coefficient of the quadratic term 2 < 0 X1 Dummy Variable Models • Categorical explanatory variable (dummy variable) with two or more levels: • Yes or no, on or off, male or female, • Coded as 0 or 1 • Only intercepts are different • Assumes equal slopes across categories • The number of dummy variables needed is (number of levels - 1) • Regression model has same form: Yi 0 1 X1i 2 X 2i k X ki i Dummy-Variable Models (with 2 Levels) Given: Yˆi b0 b1 X1i b2 X 2i Y = Assessed Value of House X1 = Square footage of House X2 = Desirability of Neighborhood = Desirable (X2 = 1) Yˆi b0 b1 X1i b2 (1) (b0 b2 ) b1 X1i Undesirable (X2 = 0) Yˆ b b X b (0) b b X i 0 1 1i 2 0 1 1i 0 if undesirable 1 if desirable Same slopes Dummy-Variable Models (with 2 Levels) (continued) Y (Assessed Value) Same slopes b1 b0 + b2 Intercepts different b0 X1 (Square footage) Interpretation of the Dummy Variable Coefficient (with 2 Levels) Example: Yˆi b0 b1 X1i b2 X 2i 20 5 X1i 6 X 2i Y : Annual salary of college graduate in thousand $ X1 : GPA X 2: 0 Female 1 Male On average, male college graduates are making an estimated six thousand dollars more than female college graduates with the same GPA. Dummy-Variable Models (with 3 Levels) Given: Y Assessed Value of the House (1000 $) X 1 Square Footage of the House Style of the House = Split-level, Ranch, Condo (3 Levels; Need 2 Dummy Variables) 1 if Split-level 1 if Ranch X2 X3 0 if not 0 if not Yˆi b0 b1 X 1 b2 X 2 b3 X 3 Interpretation of the Dummy Variable Coefficients (with 3 Levels) Given the Estimated Model: Yˆi 20.43 0.045 X 1i 18.84 X 2i 23.53 X 3i For Split-level X 2 1 : Yˆi 20.43 0.045 X 1i 18.84 For Ranch X 3 1 : Yˆi 20.43 0.045 X 1i 23.53 For Condo: Yˆ 20.43 0.045 X i 1i With the same footage, a Splitlevel will have an estimated average assessed value of 18.84 thousand dollars more than a Condo. With the same footage, a Ranch will have an estimated average assessed value of 23.53 thousand dollars more than a Condo. Dummy Variables • Predict Weekly Sales in a Grocery Store • Possible independent variables: – Price – Grocery Chain • Data Set: – Grocery.xls • Interaction Effect? Interaction Regression Model • Hypothesizes interaction between pairs of X variables – Response to one X variable varies at different levels of another X variable • Contains two-way cross product terms – Yi 0 1 X1i 2 X 2i 3 X 1i X 2i i • Can be combined with other models – E.G., Dummy variable model Effect of Interaction • Given: – Yi 0 1 X1i 2 X 2i 3 X 1i X 2i i • Without interaction term, effect of X1 on Y is measured by 1 • With interaction term, effect of X1 on Y is measured by 1 + 3 X2 • Effect changes as X2 increases Interaction Example Y Y = 1 + 2X1 + 3X2 + 4X1X2 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1 12 8 Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1 4 0 X1 0 0.5 1 1.5 Effect (slope) of X1 on Y does depend on X2 value Interaction Regression Model Worksheet Case, i Yi X1i X2i X1i X2i 1 2 3 4 : 1 4 1 3 : 1 8 3 5 : 3 5 2 6 : 3 40 6 30 : Multiply X1 by X2 to get X1X2. Run regression with Y, X1, X2 , X1X2 Evaluating Presence of Interaction • Hypothesize interaction between pairs of independent variables • Contains 2-way product terms Yi 0 1 X1i 2 X 2i 3 X 1i X 2i i Using Transformations • Requires data transformation • Either or both independent and dependent variables may be transformed • Can be based on theory, logic or scatter diagrams Inherently Linear Models • Non-linear models that can be expressed in linear form – Can be estimated by least squares in linear form • Require data transformation Transformed Multiplicative Model (LogLog) 1 2 Original: Yi 0 X 1i X 2i i Transformed: ln Yi ln 0 1ln X1i 2ln X 2i ln i 1 1 Y Y 0 1 1 1 1 0 1 1 1 1 X1 Similarly for X2 X1 Square Root Transformation Yi 0 1 X1i 2 X 2i i Y 1 > 0 Similarly for X2 1 < 0 X1 Transforms one of above model to one that appears linear. Often used to overcome heteroscedasticity. Linear-Logarithmic Transformation Yi 0 1 ln( X1i ) 2 ln( X 2i ) i Y 1 > 0 Similarly for X2 1 < 0 X1 Transformed from an original multiplicative model Exponential Transformation (Log-Linear) Original Model Y Yi e 0 1 X1i 2 X 2 i i 1 > 0 1 < 0 Transformed Into: X1 ln Yi 0 1 X1i 2 X 2i ln 1 Model Building / Model Selection • Find “the best” set of explanatory variables among all the ones given. • “Best subset” regression (only linear models) – Requires a lot of computation (2N regressions) • “Stepwise regression” • “Common Sense” methodology – Run regression with all variables – Throw out variables not statistically significant – “Adjust” model by including some excluded variables, one at a time • Tradeoff: Parsimony vs. Fit Association ≠ Causation ! Regression Limitations • R2 measures the association between independent and dependent variables Association ≠ Causation ! • Be careful about doing predictions that involve extrapolation • Inclusion / Exclusion of independent variables is subject to a type I / type II error Multi-collinearity • What? – When one independent variable is highly correlated (“collinear”) with one or more other independent variables – Examples: • square feet and square meters as independent variables to predict house price (1 sq ft is roughly 0.09 sq meters) • “total rooms” and bedrooms plus bathrooms for a house • How to detect? – Run a regression with the “not-so-independent” independent variable (in the examples above: square feet and total rooms) as a function of all other remaining independent variables, e.g.: • X1 = β0 + β2 X2 + …+ βk Xk – If R2 of the above regression is > 0.8, then one suspects multicollinearity to be present Multi-collinearity (continued) • What effect? – Coefficient estimates are unreliable – Can still be used for predicting values for Y – If possible, delete the “not-so-independent” independent variable • When to check? – When one suspects that two variables measure the same thing, or when the two variables are highly correlated – When one suspects that one independent variable is a (linear) function of the other independent variables