Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TO: New England Actuarial Seminars FROM: VEE Regression Candidate DATE: August 10, 2012 SUBJECT: Analysis of Factors that Affect Consumption of Petrol in United States The consumption of petrol is related with several factors, for example, petrol tax, average incomes, paved highways, and proportion of population with driver’s license, etc. To study and quantify the impact from these factors to consumption of petrol, we did the following research, and summarized our study and conclusions in this paper. This paper uses multiple linear regression, several hypothesis tests (F-test, ANOVA test, etc.) to estimate the consumption of petrol in 48 states all over United States. This paper includes three parts: 1, description of the data and methods that would be used in the analysis; 2, results from our analysis; 3, conclusion of our analysis, disadvantage of our analyzing methods, and possible alternative analyzing methods. Data The data source is from the website: http://orion.math.iastate.edu/burkardt/data/regression/x16.txt This data set was measured in 48 states in United States for one year, and it was originally referred from pages 32-33 of the book Applied Linear Regression (S Weisberg, 1980 edition). There are 48 rows and 5 columns in the data, standing for 48 states and 5 variables, respectively: Consumption of petrol (Y) – it is response variable. It was measured over one year, and calculated in Million gallons. Petrol tax (X1) –It is calculated in cents per gallon Average income (X2) – It is annual personal average income all over each state, calculated in dollars. Paved highways (X3) – It is calculated in miles Percentage of population with driver’s license (X4) – This measures the percentage of the population with driver’s license within each state. The above data were not audited independently by myself. However, I believe the data source is reliable since it was already published and reviewed by professions. Methodology First, I checked the independency of the data, and make sure they are identically distributed with a constant mean and variance. Second, I used the Normal Probability Plot to check the distribution of the response variable. If skew or heavy-tail distributed, we would consider to use transform (power, log, etc.) to adjust the distribution and make it normally distributed. Third, the following multiple linear regression formula was fitted into the dataset: 𝑇𝑟𝑠𝑎𝑛𝑓𝑜𝑟𝑚 (𝑌) = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + … . + 𝛽6 𝑋6 P-value of the coefficient of each independence variable was evaluated in Excel, to decide the significance of each variable. Variables with obviously low significance tend to be dropped from the formula, which was tested by (partial) F-test. In the end, we decided a formula with significant independent variables only. R-square test was used to decide how well this formula explained the response variable. Results 1). Decide transform for response variable Y First, we applied Normal Probability Plot on the response variable (Y). This method assumes the empire percentile of response variable in the data set, and compares the sample percentile with the theoretical percentile in normal distribution. The more linear the plots are, the better the data set can be fitted into normal distribution. As shown in the Normal Probability Plot above, the response variable doesn’t seem any skew in the trend, but some heavy-tailed trend. We decide to test the power and log transformation— y^0.5, y^2, and log(y). \ Comparing the three transforms above, y^0.5 shows the best linear trend. So we decide to use the data to fit into the formula: 𝑌^0.5 = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + … . + 𝛽6 𝑋6 2). Linear regression and full F-test Then, I regressed Y^0.5 on all of the input variables, which produces the following output: Regression Statistics Multiple R 0.8137628 R Square Adjusted R Square Standard Error 0.66221 Observations 0.6307876 1.3767748 48 ANOVA df SS MS F 4 159.78762 39.946906 21.074503 Residual 43 81.506878 1.8955088 Total 47 241.2945 Regression Coefficients Standard Error 14.511483 3.1081965 X Variable 1 -0.1566893 0.1524274 X Variable 2 -0.0017041 X Variable 3 X Variable 4 Intercept Significance F 1.119E-09 t Stat P-value Lower 95% Upper 95% 2.971E-05 8.2432073 20.779759 0.3097164 -0.4640885 0.1507098 0.0003669 4.6687791 1.0279604 4.6442596 3.216E-05 -0.0024441 4.948E-05 5.866E-05 0.8435082 0.4036124 30.722033 3.7329725 8.2299114 2.245E-10 Lower 95.0% Upper 95.0% -0.0009641 8.2432073 0.4640885 0.0024441 20.779759 -0.0009641 -6.882E-05 0.0001678 -6.882E-05 0.0001678 23.193776 38.250289 23.193776 38.250289 0.1507098 In the ANOVA test, we got F-value 21.0745, and significance F 1.119E-09, producing very low p-value. It means that in the significant level 0.01, this regression formula can’t be rejected. In the Regression Statistic chart, R-square of 0.6622 means that 66.22% of total sum of square comes from regression sum of squares, which is also good. Overall, we draw to the conclusion that this model can be used to fit into this data set and to predict future usage of petrol in the states. The second part of ANOVA test compared the significance of each independent variable. We can see that x variables 2 and 4 both have very low P-value, which are good. X variable 1 has P-value 30.97%, and x variable 3 has p-value 40.36%. These high P-values indicate that at an alpa level 0.05, variables 1 and 3 are not significant. To further fit this model to our data, we would drop x variables 1 and 3 one by one, and apply partial F-test as follows. 3) Partial F-tests and null hypothesis test First, I dropped variable x1, and run the partial F-test to test the following null hypothesis: H0: β1 = β3=0 H1: at least one of β1 and β3 ≠ 0 Results of this partial F-test were listed as follows: Regression Statistics Multiple R 0.8033703 R Square Adjusted R Square Standard Error 0.6454039 Observations 0.6296441 1.3789053 48 ANOVA df Regression SS MS 2 155.73241 77.866203 Residual 45 85.562094 1.9013799 Total 47 241.2945 Coefficients Standard Error 12.897462 2.3878753 X Variable 2 -0.0015868 X Variable 4 31.112492 Intercept t Stat F 40.952471 Significance F 7.396E-11 Lower 95.0% Upper 95.0% 17.70689 8.0880348 -0.0023019 -0.0008717 -0.0023019 17.70689 0.0008717 23.717606 38.507377 23.717606 38.507377 P-value Lower 95% Upper 95% 5.4012295 2.393E-06 8.0880348 0.000355 -4.4691966 5.259E-05 3.6715522 8.4739342 7.114E-11 We can see that, after dropping x1 and x3, F-value in ANOVA test is 40.95, and R-square in Regression Statistics is 64.54%, both of which are still good. Also in ANOVA test, both independent variables have very P-value, which is also good. To test whether to reject the null hypothesis or not, we calculated 𝐹1,3 as follows: 𝑆𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑒𝑑 − 𝑆𝑆𝐸 𝑓𝑢𝑙𝑙 85.56 − 81.50 𝑑𝑓 𝑓𝑢𝑙𝑙 − 𝑑𝑓 𝑟𝑒𝑑𝑢𝑐𝑒𝑑 4−2 𝐹= = = 0.0995 = 𝐹1,3 81.50 𝑆𝑆𝐸 𝑓𝑢𝑙𝑙 4 𝑑𝑓 𝑓𝑢𝑙𝑙 At α = 0.05, the F statistic of ~0.1 is less than 4.9774. It is obvious that we shouldn’t reject the null hypothesis. Overall all, the formula should be: Y^0.5 = 12.8975 - 0.0015868X2 + 31.1125X4. Conclusion Our studies indicated that the square-root of the consumption of petrol in United Stated (Y) is linearly related with the average income (X2) and percentage of population with driver’s license with in the state (X4). The power transformation (Y^0.5) increased the linear trend of the normal probability plots of Y, and eliminated the heavy-tail. This transformation also decreased p-value of the intercept. So we chose this transformation over other transformations. In the Regression Statistics, R-square is 64.54%, indicating that 64.54% of the variation in the consumption of petrol usage can be explained by variations in average income and the percentage of population with driver’s license. ANOVA test tells us the p-values of intercept and two input variables are 2.393E-06, 5.259E-05, and 7.114E-11, respectively. All p-values are much lower than 0.05 level, indicating that all input variables are statistically significant. Therefore, my final model for life expectancy is: Y^0.5 = 12.8975 - 0.0015868X2 + 31.1125X4 Where Y = per million gallons of petrol consumed over one year within the state. X2 = per dollar annual personal average income within the state. X4 = the percentage of the population with driver license within the state. Overall, 64.54% of the variation in the consumption of petrol usage can be explained by variations in average income and the percentage of population with driver’s license. The rest of 35.46% of variation may be explained by the ownership of cars, the population level, etc. To better estimate the consumption of petrol, we would suggest to collect data of those variables.