* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Simple Linear Regression
Lasso (statistics) wikipedia , lookup
Forecasting wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Regression analysis wikipedia , lookup
Applied Business Forecasting and Planning Simple Linear Regression Simple Regression Simple regression analysis is a statistical tool That gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x). The dependent variable is the variable for which we want to make a prediction. While various non-linear forms may be used, simple linear regression models are the most common. Introduction • The primary goal of quantitative analysis is to use current information about a phenomenon to predict its future behavior. • Current information is usually in the form of a set of data. • In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor ) variable X and a dependent ( or response) variable Y. lot size Man-hours 30 73 20 50 60 128 80 170 40 87 50 108 60 135 30 69 70 148 60 132 Introduction The goal of the analyst who studies the data is to find a functional relation y f (x) between the response variable y and the predictor variable x. Statistical relation between Lot size and Man-Hour 180 160 140 120 Man-Hour 100 80 60 40 20 0 0 10 20 30 40 50 Lot size 60 70 80 90 Regression Function The statement that the relation between X and Y is statistical should be interpreted as providing the following guidelines: 1. Regard Y as a random variable. 2. For each X, take f (x) to be the expected value (i.e., mean value) of y. 3. Given that E (Y) denotes the expected value of Y, call the equation E (Y ) f ( x) the regression function. Pictorial Presentation of Linear Regression Model Historical Origin of Regression Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers. Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group. Construction of Regression Models Selection of independent variables • Since reality must be reduced to manageable proportions whenever we construct models, only a limited number of independent or predictor variables can or should be included in a regression model. Therefore a central problem is that of choosing the most important predictor variables. Functional form of regression relation • Sometimes, relevant theory may indicate the appropriate functional form. More frequently, however, the functional form is not known in advance and must be decided once the data have been collected and analyzed. Scope of model In formulating a regression model, we usually need to restrict the coverage of model to some interval or region of values of the independent variables. Uses of Regression Analysis Regression analysis serves Three major purposes. 1. Description 2. Control 3. Prediction The several purposes of regression analysis frequently overlap in practice Formal Statement of the Model General regression model Y 0 1 X 1. 0, and 1 are parameters 2. X is a known constant 3. Deviations are independent N(o, 2) Meaning of Regression Coefficients The values of the regression parameters 0, and 1 are not known.We estimate them from data. 1 indicates the change in the mean response per unit increase in X. Regression Line If the scatter plot of our sample data suggests a linear relationship between two variables i.e. y 0 1 x we can summarize the relationship by drawing a straight line on the plot. Least squares method give us the “best” estimated line for our set of sample data. Regression Line We will write an estimated regression line based on sample data as yˆ b0 b1 x The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors n n i 1 i 1 2 SSE ( yi yˆ i ) 2 y b0 b1 x Regression Line Using calculus, we obtain estimating formulas: n b1 or (x i i 1 x )( yi y ) n (x i 1 i x )2 b1 r n n n n xi yi xi yi i 1 n i 1 n i 1 n xi2 ( xi ) 2 i 1 Sy Sx b0 y b1 x i 1 Estimation of Mean Response Fitted regression line can be used to estimate the mean value of y for a given value of x. Example The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table. y x 1250 41 1380 54 1425 63 1425 54 1450 48 1300 46 1400 62 1510 61 1575 64 1650 71 Point Estimation of Mean Response From previous table we have: x 564 x 32604 y 14365 xy 818755 n 10 2 The least squares estimates of the regression coefficients are: b1 n xy x y n x 2 ( x ) 2 10(818755) (564)(14365) 10.8 10(32604) (564) 2 b0 1436.5 10.8(56.4) 828 Point Estimation of Mean Response The estimated regression function is: ŷ 828 10.8x Sales 828 10.8 Expenditur e This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8. Point Estimation of Mean Response Fitted values for the sample data are obtained by substituting the x value into the estimated regression function. For example if the advertising expenditure is $50, then the estimated Sales is: Sales 828 10.8(50) 1368 This is called the point estimate (forecast) of the mean response (sales). Example:Retail sales and floor space It is customary in retail operations to asses the performance of stores partly in terms of their annual sales relative to their floor area (square feet). We might expect sales to increase linearly as stores get larger, with of course individual variation among stores of the same size. The regression model for a population of stores says that SALES = 0 + 1 AREA + Example:Retail sales and floor space The slope 1 is as usual a rate of change: it is the expected increase in annual sales associated with each additional square foot of floor space. The intercept 0 is needed to describe the line but has no statistical importance because no stores have area close to zero. Floor space does not completely determine sales. The term in the model accounts for difference among individual stores with the same floor space. A store’s location, for example, is important. Residual The difference between the observed value yi and the corresponding fitted value ŷi . ˆi ei yi y Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand. Example: weekly advertising expenditure y 1250 1380 1425 1425 1450 1300 1400 1510 1575 1650 x 41 54 63 54 48 46 62 61 64 71 y-hat 1270.8 1411.2 1508.4 1411.2 1346.4 1324.8 1497.6 1486.8 1519.2 1594.8 Residual (e) -20.8 -31.2 -83.4 13.8 103.6 -24.8 -97.6 23.2 55.8 55.2 Estimation of the variance of the error terms, 2 The variance 2 of the error terms i in the regression model needs to be estimated for a variety of purposes. It gives an indication of the variability of the probability distributions of y. It is needed for making inference concerning regression function and the prediction of y. Regression Standard Error To estimate we work with the variance and take the square root to obtain the standard deviation. For simple linear regression the estimate of 2 is the average squared residual. s y. x 2 1 1 2 2 ˆ e ( y y ) i n2 i i n2 To estimate , use s estimates the standard deviation of the error term in the statistical model for simple linear regression. s y. x s y. x 2 Regression Standard Error y x y-hat Residual (e) 1250 41 1270.8 -20.8 432.64 1380 54 1411.2 -31.2 973.44 1425 63 1508.4 -83.4 6955.56 1425 54 1411.2 13.8 190.44 1450 48 1346.4 103.6 10732.96 1300 46 1324.8 -24.8 615.04 1400 62 1497.6 -97.6 9525.76 1510 61 1486.8 23.2 538.24 1575 64 1519.2 55.8 3113.64 1650 71 1594.8 55.2 3047.04 y-hat = 828+10.8X square(e) total 36124.76 S y .x 67.19818 Basic Assumptions of a Regression Model A regression model is based on the following assumptions: 1. There is a probability distribution of Y for each level of X. 2. Given that µy is the mean value of Y, the standard form of the model is y f (x) where is a random variable with a normal distribution with mean 0 and standard deviation . Conditions for Regression Inference You can fit a least-squares line to any set of explanatory-response data when both variables are quantitative. If the scatter plot doesn’t show an approximately linear pattern, the fitted line may be almost useless. Conditions for Regression Inference The simple linear regression model, which is the basis for inference, imposes several conditions. We should verify these conditions before proceeding with inference. The conditions concern the population, but we can observe only our sample. Conditions for Regression Inference In doing Inference, we assume: 1. The sample is an SRS from the population. 2. There is a linear relationship in the population. 1. We can not observe the population , so we check the scatter plot of the sample data. 3. The standard deviation of the responses about the population line is the same for all values of the explanatory variable. 1. The spread of observations above and below the least-squares line should be roughly uniform as x varies. Conditions for Regression Inference Plotting the residuals against the explanatory variable is helpful in checking these conditions because a residual plot magnifies patterns. Analysis of Residual To examine whether the regression model is appropriate for the data being analyzed, we can check the residual plots. Residual plots are: Plot a histogram of the residuals Plot residuals against the fitted values. Plot residuals against the independent variable. Plot residuals over time if the data are chronological. Analysis of Residual A histogram of the residuals provides a check on the normality assumption. A Normal quantile plot of the residuals can also be used to check the Normality assumptions. Regression Inference is robust against moderate lack of Normality. On the other hand, outliers and influential observations can invalidate the results of inference for regression Plot of residuals against fitted values or the independent variable can be used to check the assumption of constant variance and the aptness of the model. Analysis of Residual Plot of residuals against time provides a check on the independence of the error terms assumption. Assumption of independence is the most critical one. Residual plots The residuals should have no systematic pattern. The residual plot to right shows a scatter of the points with no individual observations or systematic change as x increases. Degree Days Residual Plot 1 Residuals 0.5 0 0 20 40 -0.5 -1 Degree Days 60 Residual plots The points in this residual plot have a curve pattern, so a straight line fits poorly Residual plots The points in this plot show more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large. Variable transformations If the residual plot suggests that the variance is not constant, a transformation can be used to stabilize the variance. If the residual plot suggests a non linear relationship between x and y, a transformation may reduce it to one that is approximately linear. Common linearizing transformations are: 1 , log( x) x Variance stabilizing transformations are: 1 , log( y ), y y, y2 Inference about the Regression Model When a scatter plot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least square line fitted to the data to predict y for a give value of x. Now we want to do tests and confidence intervals in this setting. Inference about the Regression Model We think of the least square line we calculated from a sample as an estimate of a regression line for the population. Just as the sample mean x is an estimate of the population mean µ. Inference about the Regression Model We will write the population regression line as 0 1 x The numbers 0 and 1 are parameters that describe the population. We will write the least-squares line fitted to sample data as b0 b1 x This notation reminds us that the intercept b0 of the fitted line estimates the intercept 0 of the population line, and the slope b1 estimates the slope 1 . Confidence Intervals and Significance Tests In our previous lectures we presented confidence intervals and significance tests for means and differences in means.In each case, inference rested on the standard error s of the estimates and on t or z distributions. Inference for the slope and intercept in linear regression is similar in principal, although the recipes are more complicated. All confidence intervals, for example , have the form estimate t* Seestimate t* is a critical value of a t distribution. Confidence Intervals and Significance Tests Confidence intervals and tests for the slope and intercept are based on the sampling distributions of the estimates b1 and b0. Here are the facts: If the simple linear regression model is true, each of b0 and b1 has a Normal distribution. The mean of b0 is 0 and the mean of b1 is 1. That is, the intercept and slope of the fitted line are unbiased estimators of the intercept and slope of the population regression line. Confidence Intervals and Significance Tests The standard deviations of b0 and b1 are multiples of the model standard deviation . SEb1 S (b1 ) s (x x) 1 SEb0 S (b0 ) s n 2 x2 n 2 ( x x ) i i 1 Confidence Intervals and Significance Tests Example:Weekly Advertising Expenditure Let us return to the Weekly advertising expenditure and weekly sales example. Management is interested in testing whether or not there is a linear association between advertising expenditure and weekly sales, using regression model. Use = .05 Example:Weekly Advertising Expenditure Hypothesis: H0 : Ha : 1 0 1 0 Decision Rule: Reject H0 if t t.025;8 t 2.306 or t t.025;8 t 2.306 Example:Weekly Advertising Expenditure Test statistic: t S (b1 ) b1 S (b1 ) S y. x (x x) 2 67.2 2.38 794.4 b1 10.8 t 10.8 4. 5 2.38 Example:Weekly Advertising Expenditure Conclusion: Since t =4.5 > 2.306 then we reject H0. There is a linear association between advertising expenditure and weekly sales. Confidence interval for 1 b1 t ( 2 ; n 2 ) ( S (b1 )) Now that our test showed that there is a linear association between advertising expenditure and weekly sales, the management wishes an estimate of 1 with a 95% confidence coefficient. Confidence interval for 1 For a 95 percent confidence coefficient, we require t (.025; 8). From table B in appendix III, we find t(.025; 8) = 2.306. The 95% confidence interval is: b1 t ( 2 ; n2) ( S (b1 )) 10.8 2.306(2.38) 10.8 5.49 (5.31, 16.3) Example: Do wages rise with experience? Many factors affect the wages of workers: the industry they work in, their type of job, their education and their experience, and changes in general levels of wages. We will look at a sample of 59 married women who hold customer service jobs in Indiana banks. The following table gives their weekly wages at a specific point in time also their length of service with their employer, in month. The size of the place of work is recorded simply as “large” (100 or more workers) or “small.” Because industry, job type, and the time of measurement are the same for all 59 subjects, we expect to see a clear relationship between wages and length of service. Example: Do wages rise with experience? Example: Do wages rise with experience? Example: Do wages rise with experience? From previous table we have: x 4159 x 451031 y 23069 y 9460467 xy 1719376 n 59 2 2 The least squares estimates of the regression coefficients are: b1 r sy sx b0 y bx Example: Do wages rise with experience? What is the least-squares regression line for predicting Wages from Los? Suppose a woman has been with her bank for 125 months. What do you predict she will earn? If her actual wages are $433, then what is her residual? The sum of squared residuals for the entire sample is ( y yˆ ) 385453.641 59 2 i 1 i i Example: Do wages rise with experience? Do wages rise with experience? The hypotheses are: H0: 1 = 0, Ha: 1 > 0 The test statistics t The P- value is: Conclusion: b1 SEb1 Example: Do wages rise with experience? A 95% confidence interval for the average increase in wages per month of stay for the regression line in the population of all married female customer service workers in Indiana bank is b1 t * SEb1 The t distribution for this problem has n-2 = 57 degrees of freedom Example: Do wages rise with experience? Regression calculations in Practice are always done by software. The computer out put for the case study is given in the following slide. Example: Do wages rise with experience? Using the regression Line One of the most common reasons to fit a line to data is to predict the response to a particular value of the explanatory variable. In our example, the least square line for predicting the weekly earnings for female bank customer service workers from their length of service is yˆ 349.4 .5905 x Using the regression Line For a length of service of 125 months, our leastsquares regression equation gives yˆ 349.4 (.5905)(125) $423 per week There are two different uses of this prediction. We can estimate the mean earnings of all workers in the subpopulation of workers with 125 months on the job. We can predict the earnings of one individual worker with 125 months of service. Using the regression Line For each use, the actual prediction is the same,yˆ $423 .But the margin of error is different for the two cases. To estimate the mean response, we use a confidence y 0 1 x* interval. To estimate an individual response y, we use prediction interval. A prediction interval estimates a single random response y rather than a parameter like µy Using the regression Line The main distinction is that it is harder to predict for an individual than for the mean of a population of individuals. Each interval has the usual form yˆ t * SE The margin of error for the prediction interval is wider than the margin of error for the confidence interval. Using the regression Line The standard error for estimating the mean response when the explanatory variable x takes the value x* is: Using the regression Line The standard error for predicting an individual response when the explanatory variable x takes the value x* is: Prediction of a new response ( ŷ ) We now consider the prediction of a new observation y corresponding to a given level x of the independent variable. In our advertising expenditure and weekly sales, the management wishes to predict the weekly sales corresponding to the advertising expenditure of x = $50. Interval Estimation of a new response (ŷ ) The following formula gives us the point estimator (forecast) for y. yˆ b0 b1 x 1- % prediction interval for a new observation ŷ is: yˆ t (S f ) ( Where S f S y. x 2 ; n2) 1 ( x x )2 1 n ( x x )2 Example In our advertising expenditure and weekly sales, the management wishes to predict the weekly sales if the advertising expenditure is $50 with a 90 % prediction interval. yˆ 828 10.8(50) 1368 S f S y. x 1 ( x x )2 1 n ( x x )2 1 (50 56.4) 2 S f 67.2 1 72.11 10 794.4 We require t(.05; 8) = 1.860 Example The 90% prediction interval is: yˆ t(.05;8) ( S f ) 1368 1.860(72.11) (1233.9, 1502.1) Analysis of variance approach to Regression analysis Analysis of Variance is the term for statistical analyses that break down the variation in data into separate pieces that correspond to different sources of variation. It is based on the partitioning of sums of squares and degrees of freedom associated with the response variable. In the regression setting, the observed variation in the responses (yi) comes from two sources. Analysis of variance approach to Regression analysis Consider the weekly advertising expenditure and the weekly sales example. There is variation in the amount ($) of weekly sales, as in all statistical data. The variation of the yi is conventionally measured in terms of the deviations: yi y Analysis of variance approach to Regression analysis The measure of total variation, denoted by SST, is the sum of the squared deviations: SST ( yi y)2 If SST = 0, all observations are the same(No variability). The greater is SST, the greater is the variation among the y values. When we use the regression model, the measure of variation is that of the y observations variability around the fitted line: ˆi yi y Analysis of variance approach to Regression analysis The measure of variation in the data around the fitted regression line is the sum of squared deviations (error), denoted SSE: SSE ( yi yˆi )2 For our Weekly expenditure example SSE = 36124.76 SST = 128552.5 What accounts for the substantial difference between these two sums of squares? Analysis of variance approach to Regression analysis The difference is another sum of squares: SSR ( yˆi y ) 2 SSR stands for regression sum of squares. SSR is the variation among the predicted responses ŷi . The predicted responses lie on the least-square line. They show how y moves in response to x. The larger is SSR relative to SST, the greater is the role of regression line in explaining the total variability in y observations. Analysis of variance approach to Regression analysis In our example: SSR SST SSE 128552.5 36124.76 92427.74 This indicates that most of variability in weekly sales can be explained by the relation between the weekly advertising expenditure and the weekly sales. Formal Development of the Partitioning We can decompose the total variability in the observations yi as follows: yi y yˆi y yi yˆi The total deviation yi y can be viewed as the sum of two components: The deviation of the fitted value ŷi around the mean y. The deviation of yi around the fitted regression line. Formal Development of the Partitioning Skipping quite a bit of messy algebra, we just state that this analysis of variance equation always holds: ( y y) ( yˆ y) ( y yˆ ) 2 i 2 i 2 i i Breakdown of degree of freedom: n 1 1 (n 2) Mean squares A sum of squares divided by its degrees of freedom is called a mean square (MS) Regression mean square (MSR) SSR MSR 1 Error mean square (MSE) MSE SSE n2 Note: mean squares are not additive. Mean squares In our example: MSR SSR 92427.74 92427.74 1 1 SSE 36124.76 MSE 4515.6 n2 8 Analysis of Variance Table The breakdowns of the total sum of squares and associated degrees of freedom are displayed in a table called analysis of variance table (ANOVA table) Source of Variation SS df MS F-Test Regression SSR 1 MSR =SSR/1 MSR/MSE Error SSE n-2 MSE =SSE/(n-2) Total SST n-1 Analysis of Variance Table In our weekly advertising expenditure and weekly sales example the ANOVA table is: Source of variation SS df MS Regression 92427.74 1 92427.74 Error 36124.76 8 4515.6 Total 128552.5 9 Analysis of Variance Table The Analysis of Variance table reports in a different way quantities such as r2 and s that are needed in regression analysis. It also reports in a different way the test for the overall significance of the regression. If regression on x has no value for predicting y, we expect the slope of the population regression line to be close to 0. Analysis of Variance Table That is the null hypothesis of “no linear relationship” is: H 0 : 1 0 We standardize the slope of the leastsquares line to get a t statistic. F-Test for 1= 0 versus 1 0 The analysis of variance approach starts with sums of squares. If regression on x has no value for predicting y, we expect the SSR to be only a small part of the SST, most of which will be made of the SSE. The proper way to standardize this comparison is to use the ratio F MSR MSE F-Test for 1= 0 versus 1 0 In order to be able to construct a statistical decision rule, we need to know the distribution of our test statistic F. When H0 is true, our test statistic, F, follows the F- distribution with 1, and n-2 degrees of freedom. Table C-5 on page 513 of your text gives the critical values of the F-distribution at = 0.05 and .01. F-Test for 1= 0 versus 1 0 Construction of decision rule: At = 5% level Reject H0 if F F ( ;1, n 2) Large values of F support Ha and Values of F near 1 support H0. F-Test for 1= 0 versus 1 0 Using our example again, let us repeat the earlier test on 1. This time we will use the F-test. The null and alternative hypothesis are: H 0 : 1 0 H a : 1 0 Let = .05. Since n=10, we require F(.05; 1, 8). From table 5-3 we find that F(.05; 1, 8) = 5.32. Therefore the decision rule is: Reject H0 if: F 5.32 F-Test for 1= 0 versus 1 0 From ANOVA table we have MSR = 92427.74 MSE = 4515.6 Our test statistic F is: F Decision: 92427.74 20.47 4515.6 Since 20.47> 5.32, we reject H0, that is there is a linear association between weekly advertising expenditure and weekly sales. F-Test for 1= 0 versus 1 0 Equivalence of F Test and t Test: For given level, the F test of 1 = 0 versus 1 0 is equivalent algebraically to the two sided ttest. Thus, at a given level, we can use either the t-test or the F-test for testing 1 = 0 versus 1 0. The t-test is more flexible since it can be used for one sided test as well. Analysis of Variance Table The complete ANOVA table for our example is: Source of Variation SS df MS F-Test Regression 92427.74 1 92427.74 20.47 Error 36124.76 8 4515.6 Total 128552.5 9 Computer Output The EXCEL out put for our example is: SUMMARY OUTPUT Regression Statistics Multiple R 0.847950033 R Square 0.719019259 Adjusted R Square 0.683896667 Standard Error 67.19447214 Observations 10 ANOVA df SS MS Regression 1 92431.72331 92431.72 Residual 8 36120.77669 4515.097 Total 9 128552.5 Coefficients Intercept AD-Expen (X) Standard Error t Stat F 20.4717 P-value Significance F 0.0019382 Lower 95% Upper 95% 828.1268882 136.1285978 6.083416 0.000295 514.2135758 1142.0402 10.7867573 2.384042146 4.524567 0.001938 5.289142698 16.2843719 Coefficient of Determination Recall that SST measures the total variations in yi when no account of the independent variable x is taken. SSE measures the variation in the yi when a regression model with the independent variable x is used. A natural measure of the effect of x in reducing the variation in y can be defined as: SST SSE SSR SSE R 1 SST SST SST 2 Coefficient of Determination R2 is called the coefficient of determination. 0 SSE SST, it follows that: 0 R2 1 We may interpret R2 as the proportionate reduction of total variability in y associated with the use of the independent variable x. The larger is R2, the more is the total variation of y reduced by including the variable x in the model. Coefficient of Determination If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1. If the slope of the fitted regression line b1 = 0 so that yˆ i y, SSE=SST and R2 = 0. The closer R2 is to 1, the greater is said to be the degree of linear association between x and y. The square root of R2 is called the coefficient of correlation. r R2 Correlation Coefficient Recall that the algebraic expression for the correlation coefficient is. r r ( x x )( y y ) ( x x )2 ( y y)2 n xy x y n x 2 ( x ) 2 n y 2 ( y ) 2