Download TO: New England Actuarial Seminars FROM: VEE Regression Candidate

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
TO:
New England Actuarial Seminars
FROM:
VEE Regression Candidate
DATE:
August 10, 2012
SUBJECT:
Analysis of Factors that Affect Consumption of Petrol in United States
The consumption of petrol is related with several factors, for example, petrol tax, average incomes, paved
highways, and proportion of population with driver’s license, etc. To study and quantify the impact from
these factors to consumption of petrol, we did the following research, and summarized our study and
conclusions in this paper.
This paper uses multiple linear regression, several hypothesis tests (F-test, ANOVA test, etc.) to estimate
the consumption of petrol in 48 states all over United States. This paper includes three parts: 1,
description of the data and methods that would be used in the analysis; 2, results from our analysis; 3,
conclusion of our analysis, disadvantage of our analyzing methods, and possible alternative analyzing
methods.
Data
The data source is from the website: http://orion.math.iastate.edu/burkardt/data/regression/x16.txt
This data set was measured in 48 states in United States for one year, and it was originally referred from
pages 32-33 of the book Applied Linear Regression (S Weisberg, 1980 edition). There are 48 rows and 5
columns in the data, standing for 48 states and 5 variables, respectively:





Consumption of petrol (Y) – it is response variable. It was measured over one year, and
calculated in Million gallons.
Petrol tax (X1) –It is calculated in cents per gallon
Average income (X2) – It is annual personal average income all over each state, calculated in
dollars.
Paved highways (X3) – It is calculated in miles
Percentage of population with driver’s license (X4) – This measures the percentage of the
population with driver’s license within each state.
The above data were not audited independently by myself. However, I believe the data source is reliable
since it was already published and reviewed by professions.
Methodology

First, I checked the independency of the data, and make sure they are identically distributed
with a constant mean and variance.

Second, I used the Normal Probability Plot to check the distribution of the response variable.
If skew or heavy-tail distributed, we would consider to use transform (power, log, etc.) to
adjust the distribution and make it normally distributed.

Third, the following multiple linear regression formula was fitted into the dataset:
𝑇𝑟𝑠𝑎𝑛𝑓𝑜𝑟𝑚 (𝑌) = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + … . + 𝛽6 𝑋6
P-value of the coefficient of each independence variable was evaluated in Excel, to decide
the significance of each variable. Variables with obviously low significance tend to be
dropped from the formula, which was tested by (partial) F-test.

In the end, we decided a formula with significant independent variables only. R-square test
was used to decide how well this formula explained the response variable.
Results
1). Decide transform for response variable Y
First, we applied Normal Probability Plot on the response variable (Y). This method assumes the
empire percentile of response variable in the data set, and compares the sample percentile with
the theoretical percentile in normal distribution. The more linear the plots are, the better the data
set can be fitted into normal distribution.
As shown in the Normal Probability Plot above, the response variable doesn’t seem any skew in
the trend, but some heavy-tailed trend. We decide to test the power and log transformation—
y^0.5, y^2, and log(y).
\
Comparing the three transforms above, y^0.5 shows the best linear trend. So we decide to use the
data to fit into the formula:
𝑌^0.5 = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + … . + 𝛽6 𝑋6
2). Linear regression and full F-test
Then, I regressed Y^0.5 on all of the input variables, which produces the following output:
Regression Statistics
Multiple R
0.8137628
R Square
Adjusted R
Square
Standard
Error
0.66221
Observations
0.6307876
1.3767748
48
ANOVA
df
SS
MS
F
4
159.78762
39.946906
21.074503
Residual
43
81.506878
1.8955088
Total
47
241.2945
Regression
Coefficients
Standard
Error
14.511483
3.1081965
X Variable 1
-0.1566893
0.1524274
X Variable 2
-0.0017041
X Variable 3
X Variable 4
Intercept



Significance
F
1.119E-09
t Stat
P-value
Lower 95%
Upper 95%
2.971E-05
8.2432073
20.779759
0.3097164
-0.4640885
0.1507098
0.0003669
4.6687791
1.0279604
4.6442596
3.216E-05
-0.0024441
4.948E-05
5.866E-05
0.8435082
0.4036124
30.722033
3.7329725
8.2299114
2.245E-10
Lower
95.0%
Upper
95.0%
-0.0009641
8.2432073
0.4640885
0.0024441
20.779759
-0.0009641
-6.882E-05
0.0001678
-6.882E-05
0.0001678
23.193776
38.250289
23.193776
38.250289
0.1507098
In the ANOVA test, we got F-value 21.0745, and significance F 1.119E-09, producing very
low p-value. It means that in the significant level 0.01, this regression formula can’t be
rejected. In the Regression Statistic chart, R-square of 0.6622 means that 66.22% of total sum
of square comes from regression sum of squares, which is also good. Overall, we draw to the
conclusion that this model can be used to fit into this data set and to predict future usage of
petrol in the states.
The second part of ANOVA test compared the significance of each independent variable. We
can see that x variables 2 and 4 both have very low P-value, which are good. X variable 1 has
P-value 30.97%, and x variable 3 has p-value 40.36%. These high P-values indicate that at an
alpa level 0.05, variables 1 and 3 are not significant.
To further fit this model to our data, we would drop x variables 1 and 3 one by one, and
apply partial F-test as follows.
3) Partial F-tests and null hypothesis test
First, I dropped variable x1, and run the partial F-test to test the following null hypothesis:
H0: β1 = β3=0
H1: at least one of β1 and β3 ≠ 0
Results of this partial F-test were listed as follows:
Regression Statistics
Multiple R
0.8033703
R Square
Adjusted R
Square
Standard
Error
0.6454039
Observations
0.6296441
1.3789053
48
ANOVA
df
Regression
SS
MS
2
155.73241
77.866203
Residual
45
85.562094
1.9013799
Total
47
241.2945
Coefficients
Standard
Error
12.897462
2.3878753
X Variable 2
-0.0015868
X Variable 4
31.112492
Intercept
t Stat
F
40.952471
Significance
F
7.396E-11
Lower
95.0%
Upper
95.0%
17.70689
8.0880348
-0.0023019
-0.0008717
-0.0023019
17.70689
0.0008717
23.717606
38.507377
23.717606
38.507377
P-value
Lower 95%
Upper 95%
5.4012295
2.393E-06
8.0880348
0.000355
-4.4691966
5.259E-05
3.6715522
8.4739342
7.114E-11
We can see that, after dropping x1 and x3, F-value in ANOVA test is 40.95, and R-square in
Regression Statistics is 64.54%, both of which are still good. Also in ANOVA test, both
independent variables have very P-value, which is also good.
To test whether to reject the null hypothesis or not, we calculated 𝐹1,3 as follows:
𝑆𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑒𝑑 − 𝑆𝑆𝐸 𝑓𝑢𝑙𝑙
85.56 − 81.50
𝑑𝑓 𝑓𝑢𝑙𝑙 − 𝑑𝑓 𝑟𝑒𝑑𝑢𝑐𝑒𝑑
4−2
𝐹=
=
= 0.0995 = 𝐹1,3
81.50
𝑆𝑆𝐸 𝑓𝑢𝑙𝑙
4
𝑑𝑓 𝑓𝑢𝑙𝑙
At α = 0.05, the F statistic of ~0.1 is less than 4.9774. It is obvious that we shouldn’t reject the
null hypothesis. Overall all, the formula should be:
Y^0.5 = 12.8975 - 0.0015868X2 + 31.1125X4.
Conclusion
Our studies indicated that the square-root of the consumption of petrol in United Stated (Y) is
linearly related with the average income (X2) and percentage of population with driver’s license
with in the state (X4).
The power transformation (Y^0.5) increased the linear trend of the normal probability plots of Y,
and eliminated the heavy-tail. This transformation also decreased p-value of the intercept. So we
chose this transformation over other transformations.
In the Regression Statistics, R-square is 64.54%, indicating that 64.54% of the variation in the
consumption of petrol usage can be explained by variations in average income and the
percentage of population with driver’s license.
ANOVA test tells us the p-values of intercept and two input variables are 2.393E-06, 5.259E-05,
and 7.114E-11, respectively. All p-values are much lower than 0.05 level, indicating that all
input variables are statistically significant.
Therefore, my final model for life expectancy is:
Y^0.5 = 12.8975 - 0.0015868X2 + 31.1125X4
Where
Y = per million gallons of petrol consumed over one year within the state.
X2 = per dollar annual personal average income within the state.
X4 = the percentage of the population with driver license within the state.
Overall, 64.54% of the variation in the consumption of petrol usage can be explained by
variations in average income and the percentage of population with driver’s license. The rest of
35.46% of variation may be explained by the ownership of cars, the population level, etc. To
better estimate the consumption of petrol, we would suggest to collect data of those variables.