Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECONOMETRICS Lecturer Dr. Veronika Alhanaqtah Topic 1. Linear regression 1.Simple linear regression 2. The linear correlation coefficient 3. Modeling linear relationships with randomness present 4. The least squares regression line 5. Statistical inferences about Ξ²2 6. The coefficient of determination 7. Estimation and prediction 8. A complete example (Homework) 1. Simple linear regression ο Learning objective: to learn what it means for two variables to exhibit a relationship that is close to linear but which contains an element of randomness. Simple linear regression π =πβπ+π y is dependent variable x is independent variable b is the y-intercept m is the slope of the regression line Example 1: deterministic relationship ο π¦= 9 β 5 π₯ + 32 x -40 -15 0 20 50 y 32 68 122 -40 5 The relationship between x and y is called a simple linear relationship because the points so plotted that all lie on a single straight line. βsimpleβ: y depends on only one other variable. βmultipleβ: y depends on two or more variables. 9/5 is the slope of the line and measures its steepness. 32 is the y-intercept of the line. Example 2: presence of randomness Relationship: the height x of a man aged 25 and his weight y. We randomly select several 25-year-old men and measure the height and weight of each one, we might obtain a collection of (x, y) pairs something like this: ο 76 75 74 73 72 71 70 69 68 67 0 50 100 150 200 X 151 163 146 180 157 170 164 175 171 178 160 188 Y 68 72 69 72 70 73 70 73 71 74 72 75 2. Linear correlation coefficient ο Learning objective: to learn what the linear correlation coefficient is, how to compute it, and what it tells us about the relationship between two variables x and y. Linear correlation coefficient Properties of correlation coefficient: (1) The value of r lies between β1 and 1, inclusive. (2) The sign of r indicates the direction of the linear relationship between x and y: ο if r < 0 then y tends to decrease as x is increased; ο if r > 0 then y tends to increase as x is increased. (3) The size of |r| indicates the strength of the linear relationship between x and y: ο If |r| is near 1 (that is, if r is near either 1 or β1) then the linear relationship between x and y is strong; ο if |r| is near 0 (that is, if r is near 0 and of either sign) then the linear relationship between x and y is weak. Linear correlation coefficient Correlation coefficient: calculation οπ = πΊπΊππ πΊπΊππ βπΊπΊππ where ο πππ₯π₯ = ο πππ₯π¦ = ο πππ¦π¦ = 1 π₯ β β( π 1 π₯π¦ β β ( π 1 2 π¦ β β( π 2 π₯)2 π₯) β ( π¦) π¦)2 Example 3: calculation of correlation coefficient ο Compute the linear correlation coefficient for the height and weight pairs plotted in Graph (in Example 2). x y x2 y2 xy 151 68 22801 4624 10268 163 72 26569 5184 11736 146 69 21316 4761 10074 180 72 32400 5184 12960 157 70 24649 4900 10990 170 73 28900 5329 12410 164 70 26896 4900 11480 175 73 30625 5329 12775 171 71 29241 5041 12141 178 74 31684 5476 13172 160 72 25600 5184 11520 188 75 35344 5625 14100 2003 859 336025 61537 143626 Example 3: calculation of correlation coefficient πππ¦π¦ = 1 π¦2 β π πππ₯π¦ = π₯π¦ β πππ₯π₯ = 1 π₯2 β π 2 π¦ 1 π = 61537 β π₯ π¦ 1 859 12 π= = 336025 β πππ₯π¦ πππ₯π₯ β πππ¦π¦ = 46.916 = 143626 β 2 π₯ 2 = 1 2003 12 1 859 12 2 (2003) = 244.583 = 1690.916 244.583 46.916 1690.916 = 0.868 The number r = 0.868 quantifies what is visually apparent from Graph: weights tends to increase linearly with height (r is positive) and although the relationship is not perfect, it is reasonably strong (r is near 1). ο 3. Modeling linear relationships with randomness present Learning objective: to learn the framework in which the statistical analysis of the linear relationship between two variables x and y will be done. Modeling linear relationships with randomness present Situation: the value of x can be used to draw conclusions about the value of y, such as predicting the resale value y of a residential house based on its size x. The relationship between x and y is not deterministic. The set of assumptions in simple linear regression are a mathematical description of the relationship between x and y. Modeling linear relationships with randomness present: Assumptions (1) Relationship between x and the mean of the y-values in the subpopulation determined by x is linear. This means that there exist numbers Ξ²1 and Ξ²2 such that: πΈ(π¦) = π½1 + π½2 β π₯π *This line gives the mean of the variable y over the sub-population determined by x (population regression line). (2) For each value of x the y-values scatter about the mean E(y) according to a normal distribution centered at E (y) and with a standard deviation Ο that is the same for every value of x. This is the same as saying that there exists a normally distributed random variable Ξ΅ with the mean 0 and an unknown standard deviation Ο so that the relationship between x and y in the whole sample is: π¦π = π½1 + π½2 β π₯π + π where Ξ²1 and Ξ²2 are fixed (observable) parameters and Ξ΅ is a normally distributed random variable (non-observable). (3) Random deviations associated with different observations are independent. The simple linear model concept The symbols N (ΞΌ, Ο2) denote a normal distribution with mean ΞΌ and variance Ο2, hence standard deviation Ο. The simple linear model concept π¦π = π½1 + π½2 β π₯π + π ο Deterministic part describes the trend in y as x increases. There is nothing random in this part. ο Random part. Ξ΅ (epsilon) is a random variable, called the error term or the noise. This part explains why the actual observed values of y are not exactly on but fluctuate near a line. Information about this term is important since only when one knows how much noise there is in the data can one know how trustworthy the detected trend is. ο The slope parameter Ξ²2 represents the expected change in y brought about by a unit increase in x. The standard deviation Ο represents the magnitude of the noise in the data. 4. Least squares regression line ο Learning objective: to learn how to measure how well a straight line fits a collection of data; how to construct the least squares regression line; how to use the least squares regression line to estimate the response variable y in terms of the predictor variable x. ο The next step in the analysis is to find the straight line that best fits the data. Example 4. x 2 2 6 8 10 y 0 1 2 3 3 π ο How well the straight line π = π π β π fits the data set. ο The line π = π β π was selected as one that seems to fit the data π reasonably well. π error at data point (x, y) = (true y) β (predicted y) = y β π¦ Example 4. Computation of the error x y 2 2 6 8 10 0 1 2 3 3 x y 2 2 6 8 10 0 1 2 3 3 1 y= xβ1 2 π¦ β π¦ (π¦ β π¦)2 1 y= xβ1 2 0 0 2 3 4 π¦ β π¦ (π¦ β π¦)2 β β 0 1 0 0 -1 0 0 1 0 0 1 2 Least squares regression line ο ο ο (π¦ β π¦)2 β sum of squared errors, measures a goodness of fit Least squares regression line y = Ξ²2 x + Ξ²1 best fits the data in the sense of minimizing the sum of the squared errors. Ξ²2 -slope, Ξ²1 - y-intercept: Ξ²2 = ο SSxy Ξ²1 = y β Ξ²2 π₯ SSxx where πππ₯π₯ = π₯2 1 β β π 1 2 π₯ πππ₯π¦ = π₯π¦ β β ( π₯) β ( π¦) π x is the mean of all the x-values y is the mean of all the y-values n is the number of pairs in the data set. Example 5. ο Find the least squares regression line for the five-point data set. x 2 2 6 8 10 y 0 1 2 3 3 x2 xy x 2 2 6 8 10 28 y 0 1 2 3 3 9 x2 4 4 36 64 100 208 xy 0 2 12 24 9 68 β β Example 5. ο Find the least squares regression line for the five-point data set. πππ₯π₯ = πππ₯π¦ = 2 1 2 π₯ β β π π₯π¦ β 1 β π π₯ π₯ β 1 = 208 β 28 5 π¦ = 68 β π₯= π₯ 28 = = 5.6 π 5 π¦= π¦ 9 = = 1.8 π 5 2 = 51.2 1 28 β 9 = 17.6 5 SSxy 17.6 Ξ²2 = = = 0.34375 SSxx 51.2 Ξ²1 = y β Ξ²2 π₯ = 1.8 β 0.34275 5.6 = 0.125 y = Ξ²2 x + Ξ²1 = π. ππππππ β π. πππ Example 5. ο Table. The errors in fitting data with the least squares regression line x y π = π. ππππππ β π. πππ π¦βπ¦ (π¦ β π¦)2 2 0 0.5625 β0.5625 0.31640625 2 1 0.5625 0.4375 0.19140625 6 2 1.9375 0.0625 0.00390625 8 3 2.6250 0.3750 0.14062500 10 3 3.3125 β0.3125 0.09765625 Sum of squared errors ο ο To measure the goodness of fit of a line to a set of data, we compute the predicted y-value π¦ at every point in the data set, compute each error, square it, and then add up all the squares. In the case of the least squares regression line, the line that best fits the data, the sum of the squared errors can be computed directly from the data using formula: πππΈ = πππ¦π¦ β π½2 πππ₯π¦ Example 6. x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Table above shows the age in years and the retail value in thousands of dollars of a random sample of ten automobiles of the same make and model. (a) Construct the scatter diagram. (b) Compute the linear correlation coefficient r. Interpret its value in the context of the problem. (c) Compute the least squares regression line. Plot it on the scatter diagram. (d) Interpret the meaning of the slope of the least squares regression line in the context of the problem. (e) Suppose a four-year-old automobile of this make and model is selected at random. Use the regression equation to predict its retail value. ο Example 6 (a). Scatter diagram for age and value of used automobiles x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Example 6 (b). Correlation coefficient x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Compute the linear correlation coefficient r. Interpret its value in the context of the problem. ο πππ₯π₯ = ο πππ₯π¦ = ο πππ¦π¦ = 1 π₯ β β π 1 π₯π¦ β β π 1 2 π¦ β β π 2 π= π₯ 2 π₯ β π¦ π¦ 2 πππ₯π¦ πππ₯π₯ β πππ¦π¦ Example 6 (b). Correlation coefficient x 2 y 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Compute the linear correlation coefficient r. Interpret its value in the context of the problem. π₯ = 40 π¦ = 246.3 π₯ 2 = 174 π¦ 2 = 6154.15 π₯π¦ = 956.5 1 ο πππ₯π₯ = π₯2 β π β ο πππ₯π¦ = π₯π¦ β π β ο πππ¦π¦ = π¦2 β β ο π= 1 πΊπΊππ πΊπΊππ βπΊπΊππ 1 π = π₯ 2 1 = 174 β 10 40 2 = 14 1 π₯ β π¦ 2 π¦ = 956.5 β 10 40 β 246.3 = β28.7 = 6154.15 β βππ.π ππ ππ.πππ 1 10 = βπ. πππ 246.3 2 = 87.781 Example 6 (c). x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Compute the least squares regression line. Plot it on the scatter diagram. ο ο π₯= π₯ π Ξ²2 = and π¦= SSxy SSxx ο Ξ²1 = y β Ξ²2 π₯ ο π = π·π π± + π·π π¦ π Example 6 (c). x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 π₯= π₯ π 40 = 10 = 4 and π¦ = π¦ π = 246.3 10 SSxy β28.7 Ξ²2 = = = β2.05 SSxx 14 Ξ²1 = y β Ξ²2 π₯ = 24.63 β β2.05 4 = 32.83 y = Ξ²2 x + Ξ²1 = βπ. πππ + ππ. ππ = 24.63 Example 6 (d). x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Interpret the meaning of the slope of the least squares regression line in the context of the problem. ο The slope β2.05 means that for each unit increase in x (additional year of age) the average value of vehicle decreases by about 2.05 units (about $2,050). Example 6 (e). Prediction x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Suppose a four-year-old automobile of this make and model is selected at random. Use the regression equation to predict its retail value. Since we know nothing about the automobile other than its age, we assume that it is of about average value and use the average value of all four-year-old vehicles as our estimate. ο y = β2.05 4 + 32.83 = 24.63, which corresponds to $24,630. ο 5. Statistical inferences about π½2 ο Learning objective: to learn how to construct a confidence interval for Ξ²2 (the slope of the population regression line); how to test hypotheses regarding Ξ²2. 5. Statistical inference about π½2 ο The parameter Ξ²2 (the slope of the population regression line) gives the true rate of change in the mean E(y) in response to a unit increase in the predictor variable x. For every unit increase in x the mean of the response variable y changes by Ξ²2 units, increasing if Ξ²2 > 0 and decreasing if Ξ²2 < 0. The slope π½2 of the least squares regression line is a point estimate of Ξ²2 . ο A 100(1-Ξ±)% confidence interval for Ξ²2 is given by the following formula: ππ π½2 ± π‘πΌ/2 β πππ₯π₯ ο where ππ = ο πππΈ πβ2 and the number of degrees of freedom is ππ = π β 2. The statistic sΞ΅ is called the sample standard deviation of errors. It estimates the standard deviation Ο of the errors in the population of yvalues for each fixed value of x. Example 7. x 2 2 6 8 10 y 0 1 2 3 3 Construct the 95% confidence interval for the slope Ξ²2 of the population regression line based on the five-point sample data set. πππΈ πβ2 = 0.75 3 ο ππ = ο ο Confidence level 95% means Ξ± = 1 β 0.95 = 0.05 so Ξ± β 2 = 0.025. From the row labeled df = 3 in Appendix "Critical Values of t" we obtain t0.025 = 3.182. ο π½2 ± π‘πΌ/2 β ο We are 95% confident that the slope Ξ²2 of the population regression line is between 0.1215 and 0.5661. ππ πππ₯π₯ = 0.5 = 0.34375 ± 3.182 0.5 51.2 = 0.34375 ± 0.2223 Example 8. x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 ο Using the sample data from Example 6 construct a 90% confidence interval for the slope Ξ²2 of the population regression line relating age and value of the automobiles. Interpret the result in the context of the problem. ο π π = ο Confidence level 90% means Ξ± = 1 β 0.90 = 0.10 so Ξ± β 2 = 0.05. From the row labeled df = 8 in Appendix "Critical values of t" we obtain t0.05 = 1.860. ο π½2 ± π‘πΌ/2 β πππΈ πβ2 = 28.946 8 ππ πππ₯π₯ = 1.902169814 = β2.05 ± 1.860 1.902169814 14 = β2.05 ± 0.95 We are 90% confident that the slope Ξ²2 of the population regression line is between β3.00 and β1.10. ο In the context of the problem this means that for vehicles of this make and model between two and six years old we are 90% confident that for each additional year of age the average value of such a vehicle decreases by between $1,100 and $3,000. ο Testing hypothesis about π½2 ο ο ο H0 : Ξ²2 = 0, which corresponds to the situation in which x is not useful for predicting y. Test: π»0 : π½2 = 0 vs. π»π : π½2 β 0 Standardized test statistic for hypothesis tests concerning the slope Ξ²2 of the population regression line: π‘πππ = π½2 β π΅0 ππ / πππ₯π₯ where B0 is a number determined from the statement of the problem. ο The test statistic has Studentβs t-distribution with df = nβ2 degrees of freedom. Example 9. Critical value approach Test, at the 2% level of significance, whether the variable x is useful for predicting y based on the information in the five-point data set. ο π»0 : π½2 = 0 vs. π»π : π½2 β 0 πΌ = 0.02 ο π‘πππ = = ο π½2 βπ΅0 ππ / πππ₯π₯ 0.34375β0 0.5/ 51.2 = = 4.919 t-distribution with nβ2 = 5 β 2 = 3 degrees of freedom; π‘ππ = π‘πΌ/2 = π‘0.01 = 4.541 ο Compare t-observed and t-critical. ο The test statistic falls in the rejection region (π‘πππ > π‘ππ ).The decision is to reject H0. x 2 2 6 8 10 y 0 1 2 3 3 6. The coefficient of determination ο Learning objective: to learn what the coefficient of determination is, how to compute it, and what it tells us about the relationship between two variables x and y. 6. The coefficient of determination ο Coefficient of determination measures the proportion of the variability in y that is accounted for by the linear relationship between x and y. 2 SSyy β SSE SSxy SSxy R = = = Ξ²1 SSyy SSxx SSyy SSyy 2 R2 = r 2 Example 10. x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 The value of used vehicles of the make and model discussed in Example 6 varies widely. The most expensive automobile in the sample in Table has value $30,500, which is nearly half again as much as the least expensive one, which is worth $20,400. Find the proportion of the variability in value that is accounted for by the linear relationship between age and value. ο The proportion of the variability in value y that is accounted for by the linear relationship between it and age x is given by the coefficient of determination, R2. r = β0.819, R2 = (β0.819) 2 = 0.671 ο About 67% of the variability in the value of this vehicle can be explained by its age. Example 11. x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Use each of the three formulas for the coefficient of determination to compute its value for the example of ages and values of vehicles. R2 = SSyy βSSE SSyy 2 SSxy R2 = SSxx SSyy SSxy R2 = Ξ²1 SSyy The coefficient of determination R2 can always be computed by squaring the correlation coefficient r if it is known. ο What should be avoided is trying to compute r by taking the square root of R2, if it is already known, since it is easy to make a sign error this way. ο Example 11. x y 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Use each of the three formulas for the coefficient of determination to compute its value for the example of ages and values of vehicles. In Example 6 we computed the exact values: πππ₯π₯ = 14; πππ₯π¦ = β28.7; πππ¦π¦ = 87.781; Ξ²2 = β2.05. Also we know πππΈ = 28.946. ο ο R2 ο R2 ο = = SSyy βSSE SSyy SSxy 2 SSxx SSyy R2 = Ξ²1 = SSxy SSyy = 87.781β28.946 87.781 (β28.7)2 (14)(87.781) = β2.05 = 0.6702475479 = 0.6702475479 β28.7 87.781 = 0.6702475479 7. Estimation and prediction ο Learning objective: to learn the distinction between estimation and prediction; the distinction between a confidence interval and a prediction interval; to learn how to implement formulas for computing confidence intervals and prediction intervals. Example 12. ο Consider the following pairs of problems in the context of the least squares regression line, the automobile age and value example (Example 6). ο 1a. Estimate the average value of all four-year-old automobiles of this make and model. ο 1b. Construct a 95% confidence interval for the average value of all four-year-old automobiles of this make and model. ο 2a. Ahmad intends to buy a four-year-old automobile of this make and model next week. Predict the value of the first such automobile that he encounters. ο 2b. Construct a 95% confidence interval for the value of the first such automobile that he encounters. Example 12. ο 1a. Estimate the average value of all four-year-old automobiles of this make and model. ο 2a. Ahmad intends to buy a four-year-old automobile of this make and model next week. Predict the value of the first such automobile that he encounters. y = β2.05 4 + 32.83 = 24.63 Example 12. ο 1b. Construct a 95% confidence interval for the average value of all four-year-old automobiles of this make and model. ο 2b. Construct a 95% confidence interval for the value of the first such automobile that he encounters. ο ο Comments: In the first case we seek to construct a confidence interval in the same sense that we have done before. In the second case the interval constructed has a different name, prediction interval. In this case we are trying to βpredictβ where the value of a random variable will take its value. Confidence interval and prediction interval 100(1-Ξ±)% confidence interval for the mean value of y at x = xp. π¦π ± π‘πΌ/2 π π π₯π β π₯ 1 + π πππ₯π₯ 100(1-Ξ±)% prediction interval for an individual new value of y at x = xp. 2 π¦π ± π‘πΌ/2 π π π₯π β π₯ 1 1+ + π πππ₯π₯ 2 where a. xp is a particular value of x that lies in the range of xvalues in the sample data set used to construct the least squares regression line; b. yΛp is the numerical value obtained when the least square regression equation is evaluated at x = xp; c. the number of degrees of freedom for tΞ±β2 is df = nβ2. Example 13. x y ο 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Using the sample data on age and value of used automobiles of a specific make and model (Example 6), construct a 95% confidence interval for the average value of all three-and-one-half-year-old automobiles of this make and model. π¦π ± π‘πΌ π π 2 π₯π β π₯ 1 + π πππ₯π₯ 2 We are 95% confident that the average value of all three-and-one-half-yearold vehicles of this make and model is between $24,149 and $27,161. Example 14. x y ο 2 3 3 3 4 4 5 5 5 6 28.7 24.8 26.0 30.5 23.8 24.6 23.8 20.4 21.6 22.1 Using the sample data on Age and Value of Used Automobiles of a Specific Make and Model (Example 6), construct a 95% prediction interval for the predicted value of a randomly selected three-andone-half-year-old automobile of this make and model. π¦π ± π‘πΌ π π 2 π₯π β π₯ 1 1+ + π πππ₯π₯ 2 We are 95% confident that the value of a randomly selected three-and-one half-year-old vehicle of this make and model is between $21,017 and $30,293. Note what an enormous difference the presence of the extra number 1 under the square root sign made. The prediction interval is about two-andone half times wider than the confidence interval at the same level of confidence. Homework ο ο Visit instructorβs web-page on Econometrics. www.alveronika.wordpress.com Optional but useful: Practice with applet (correlation and regression): http://mih5.github.io/statapps/