Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Notes on Regression Analysis: Another approach to forecasting and analysis of data is “ structural” which is different from time series. In time series all the analysis and inferences is based on the observations and we use past observations to forecast future ones. In structural models we assume that the quantity of interest say “Y” that is referred to as response is a function of a number of other variables X1 , X2 , X3 ,… that are called predictors. In other words Y = f(X1 , X2 , X3 ,…). We observe values of X1 , X2 , X3 ,… and based on that make a forecast about value of Y. For example, consider the case of sale (y) of a product. This amount will depend on the price of the product (X1), the amount of advertisement done for this product (X2), the price of competing brands (X3), etc. After observing these quantities and by inputting them into our model we arrive at a forecast. We start with the simple case of one predictor Y = f(X) and the simplest functional relationship which is a linear one namely Y = a + b X. Using the observed data we try to find the “best” linear relationship between X and Y. This is equivalent to finding the “best” line that fits the data that have been observed. This model is called Simple Linear Regression since there is only one predictor hence the word “simple” and the relationship is taken to be linear hence the word linear. For illustration consider the following case dealing with the number of books sold in a bookstore and the amount of shelf space dedicated to books versus stationary, computers and other items: Observation Number of Books Sold (Y) Meter of Shelf Space (X) 1 275 6.8 2 142 3.3 3 168 4.1 4 197 4.2 5 215 4.8 6 188 3.9 Following is the scatter diagram of the above data 300 250 200 150 100 50 0 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 The question is: “ what is meant by the “best” line?” We use the sum of squared of error as the criterion for choosing the best line. Specifically, suppose the Y = a + b X for the above case. Following is the error squared for each observation (suppose “a” and ‘b” are knows) Observation Squared Error 1 (275-a-b 6.8)2 = e12 2 (142-a-b 3.3) 2 = e22 3 (168-a-b 4.1) 2 = e32 4 (197-a-b 4.2) 2 = e42 5 (215-a- b 4.8) 2 = e52 6 (188-a-b 3.9) 2 = e62 Let Sum of Squared Error = SSE = ∑ (Yi - a - b Xi)2 = ∑ ei2 . The “optimal” choice of “a” and “b” is such that SSE is minimized. To find these values of “a” and “b” SSE is differentiated with respect to “a” and “b” and set equal to zero and the solution gives the best values for “a” that is called the intercept and for “b” that is called the slope. Specifically we solve the following equations: ∂SSE/∂a = -2∑ (Yi - a - b Xi) = 0 ∂SSE/∂b = -2∑ Xi (Yi - a - b Xi) = 0 Following is the solution to the above equations and the resulting “a” and “b” Let X̅ = ∑ Xi /n and Y̅ = ∑ Yi /n b = ∑ (Xi - X̅ )( Yi - Y̅)/ ∑ (Xi - X̅ )2 a = Y̅ - b X̅ The basic idea of regression is that Y = a + b X + ei . The last term is called error and it is the randomness that prevents all the observations to fall on a line perfectly. We make the following assumptions about the errors “ei “ o E[ei] =0 i.e., the errors are not biased upward or downward and on average they are zero. In other words the observations do not have a tendency to be above the line or below the line and can fall on either side of the line. o VAR[ei] =σ2 This means that the variability of error does not depend on the value of the predictor X. For example, it implies that the magnitude of error does not change with the value of X. o Cov(ei , ei) = 0 this implies that there is no correlations between errors e.g., if we underestimate at one point it has no bearing whether we underestimate or overestimate y at another point. o Later on we assume that ei ‘s are independent and identically distributed as normal distribution with mean 0 and variance σ2 (conditions 1 and 2 above) and in the case of normal distributions condition 3 implies that errors are independent of each other. Explaining Capability of Regression o Variability is a source of uncertainty in forecasting. The more variable and unpredictable the numbers are the more difficult it is to predict them. We define sum of squares as the amount of variability of a set of numbers ( Dividing the sum of squares by the number of observations is the sample variance). In the above case we define the Total Sum of Squares as ∑ (Yi - Y̅)2 which is the variability of the quantity of interest i.e., the response variables that we would like to forecast. o A variability that can be explained and predicted is not a source of uncertainty. For example, consider the forecasted values of a regression namely ŷi = a+b xi Though ŷi ‘s are variable but this variability can be explained completely since they are points on the regression lines and knowing xi enables us to forecast ŷi ,the point on the regression line, with certainty and without error. It should also be pointed out that the following identity always holds for regression ∑ Yi = ∑ ŷi . Based on the above argument the following sum of squares ∑ (ŷi - Y̅)2 is quite explainable and due to the fact that the points are on the regression line. This sum is called Sum of Squares Due to Regression. o It can be shown that ∑ (Yi - Y̅)2 = ∑ (ŷi - Y̅)2 + ∑ (Yi - ŷi )2 (Note that the last summation is Sum of Squared Error.) The above expression indicates Total Sum of Squares = Sum of Squares Due to Regression + Sum of Squares about Regression Recall that the first term of the right hand side of the above equation namely, Sum of Squares Due to Regression is explainable and only Sum of Squared Errors i.e., ∑ (Yi - ŷi )2 cannot be explained. Now the portion of the Total Variability that is explainable by regression is: ∑ (ŷi - Y̅)2 / ∑ (Yi - Y̅)2 = Sum of Squares Due to Regression / Total Sum of Squares = R2 and we used this quantity to measure the effectiveness of the regression equation, i.e., the percentage of variability of the response i.e., y that can be explained by the regression. It should be pointed out that R2 = Correlation (X,Y)2