Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
The Wealth of Nations Jamie Brabston Matt Caulfield Mark Testa Overview Introduction Regression of Individual Variables Multicollinearity Multiple Regression Stepwise Regression Final Model Introduction Collected data for 30 countries 12 variables Life expectancy, median age, population growth, population density, literacy rate, unemployment rate, oil consumption – oil production, cell phone / land line, military expenditures, area, sex ratio, external debt Goal: create a model to predict GDP per capita Life Expectancy Residuals vs Fitted Normal Q-Q 8 4 8 Standardized residuals 0 1 2 20 50 4 6 -1 -10 40 Residuals 0 10 6 0 10 20 -2 -1 0 1 2 Theoretical Quantiles Scale-Location Residuals vs Leverage 4 8 4 8 Standardized residuals 0 1 2 6 1 11 0.5 -1 10 30 Fitted values Standardized residuals 0.5 1.0 20 1.5 GDP 30 -10 0.0 0 -10 50 55 60 65 70 life.expectancy 75 80 0.5 Cook's distance 0 10 Fitted values 20 30 0.0 0.1 0.2 Leverage 0.3 Life Expectancy Analysis: R2: 0.45. P-value: Highly significant. An outlier was identified using a Leverage-residual plot and removed. Residuals vs. Fitted Values plot showed nonlinearity. Tried a Box-Cox transform. Life Expectancy Leverage-residual plot - Top: Influential data points. 0.35 0.30 0.25 influence 0.20 Upshot: Eliminate points in the top right quadrant as influential outliers. 14 26 0.15 - Right: Outliers. 0.10 - Left: Non-outliers. 2 0.05 - Bottom: Non-influential data points. 4 28 0.0 18 30 20 1 5 15 11 22 19 29 12279 8 17 2413 10 23 21 16325 7 6 0.5 1.0 1.5 2.0 Absolute value of externally studentised residuals 2.5 Life Expectancy 300 Profile likelihood 200 250 This plot shows the goodness of the fit as a function of p. In this case, the optimal p is fairly small. 350 Box-Cox plot 150 Box-Cox Transform: y -> (yp - 1)/p Produces linear fit if variables are related by a power law. 100 -4 -2 0 p 2 4 Life Expectancy Linear regression was done on the BC transformed data. Significant nonlinearity remained. Normal Q-Q 4 2.0 Residuals vs Fitted 7 7 Residuals 0 2 12 -2 10 -4 8 -2 0 2 4 6 5 Standardized residuals -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 5 4 8 10 -2 -1 GDP.1bc 6 Fitted values 1.2 4 Residuals vs Leverage 1 2 7 Standardized residuals -1 0 1 0.2 75 80 13 0.5 Cook's distance 0 2 4 6 Fitted values 8 10 0.5 25 0.0 70 life.expectancy.1 2 5 Standardized residuals 0.4 0.6 0.8 1.0 4 2 65 1 4 -2 60 0 4 Theoretical Quantiles Scale-Location 0 0.00 0.05 0.10 0.15 0.20 Leverage 0.25 0.30 Life Expectancy Conclusions: Clearly, there is a significant positive relationship between per capita GDP and life expectancy. We could not identify the precise nature of the relationship. This prevents extrapolation and prediction. Median Age Residuals vs Fitted Normal Q-Q 8 2 20 8 11 Residuals 0 10 -20 -1 -10 40 11 25 Standardized residuals 0 1 50 25 30 0 10 20 30 40 -2 GDP Fitted values 1 2 Residuals vs Leverage 8 11 0.5 8 11 Standardized residuals -1 0 1 Standardized residuals 0.5 1.0 20 2 25 12 -2 Cook's distance 0 0.0 10 0 Theoretical Quantiles Scale-Location 1.5 -1 0 20 25 30 median.age 35 40 10 20 Fitted values 30 40 0.00 0.04 0.08 Leverage 0.12 Median Age Analysis: R2: 0.58. P-value: Highly significant. No suspected outliers. The plot of Residuals vs. Fitted values is approximately linear, but significantly deviated from normal. Median Age Box-Cox Transform gives: Normal Q-Q 2 Residuals vs Fitted 25 Standardized residuals -1 0 1 -1 4 Residuals 0 1 5 2 25 12 12 -2 -2 7 2 1.5 GDP.2bc 3 1 3 4 5 7 -2 -1 0 1 2 Fitted values Theoretical Quantiles Scale-Location Residuals vs Leverage 25 7 0.5 2 25 Standardized residuals -1 0 1 26 -2 1 2 Standardized residuals 0.5 1.0 12 7 0.0 Cook's distance 1 20 25 30 median.age 35 40 2 3 4 Fitted values 5 0.00 0.04 0.08 Leverage 0.5 0.12 Median Age Box-Cox transform significantly improved the normality of the residual distribution. The Box-Cox p = 0.15. R2 is improved to 0.72. Final Model: (GDP0.15 – 1)/0.15 = -2.1 + 0.17(Med.Age) Population Growth 11 Normal Q-Q 25 8 -20 30 11 15 20 25 -2 -1 0 1 Fitted values Theoretical Quantiles Scale-Location Residuals vs Leverage 2 25 20 1.5 GDP 10 25 8 Standardized residuals 0.5 1.0 Standardized residuals 0 1 2 11 1 11 8 0.5 -1 10 8 -1 -10 40 Residuals 0 10 20 Standardized residuals 0 1 2 50 30 Residuals vs Fitted 25 0 0.0 Cook's distance 10 0 1 2 population.growth 3 4 15 20 Fitted values 25 0.00 0.05 0.10 0.15 0.20 Leverage 0.5 0.25 0.30 Population Growth Analysis: R2 = 0.058. p-value: 0.11. Correlation is very low, and the p-value is outside any reasonable significance level. An outlier was found and eliminated using a Leverage-Residual plot. Population Growth Box-Cox Transform: 2.0 Residuals vs Fitted Normal Q-Q 1.5 6 15 1.0 Standardized residuals 0 1 6 1.5 2.0 2.5 3.0 4.0 -2 -1 0 1 Theoretical Quantiles Scale-Location Residuals vs Leverage 0.2 Standardized residuals -1 0 1 Standardized residuals 0.4 0.6 0.8 1.0 1.2 0.5 6 25 12 0.0 0.5 1.0 1.5 population.growth.3 2.0 2.5 Cook's distance -2 0.0 -0.5 2 11 15 1 3.5 Fitted values 1.4 2 1.0 2 GDP.3bc -1.5 -1 3 Residuals -0.5 0.0 0.5 4 11 2 11 15 1.0 1.5 2.0 2.5 3.0 Fitted values 3.5 4.0 0.00 0.05 0.10 Leverage 5 0.5 0.15 0.20 Population Growth A Box-Cox transform improved the nonlinearity slightly, and gave a significant p-value. From this, we concluded that population growth has a slight negative relationship with GDP. No detailed predictions are possible because significant nonlinearity remains. Population Density 40 Residuals vs Fitted Normal Q-Q 8 50 2 30 8 25 GDP 1.5 11 -1 -10 -20 20 30 6 Standardized residuals 0 1 22 40 Residuals 0 10 20 6 30 35 40 45 -2 -1 0 1 2 Fitted values Theoretical Quantiles Scale-Location Residuals vs Leverage 8 8 2 6 Standardized residuals 0 1 11 11 1 0.5 0.5 1 -1 10 Standardized residuals 0.5 1.0 20 6 0.0 Cook's distance 0 20 0 1000 2000 3000 4000 population.density 5000 6000 25 30 35 Fitted values 40 45 0.0 0.2 0.4 0.6 Leverage 0.8 Population Density Analysis: The outlier on the far right corresponds to Singapore, a country with an exceptionally high population density. A less extreme outlier is China. Both of these data points were removed. Population Density Residuals vs Fitted Normal Q-Q 7 2 50 30 7 6 Standardized residuals 0 1 20 Residuals 0 10 -1 -10 40 -20 12 12 17 GDP.4 30 6 18 19 20 21 -2 -1 Fitted values Scale-Location 1.5 0 1 2 Theoretical Quantiles Residuals vs Leverage 7 2 20 7 6 Standardized residuals 0 1 0.5 18 -1 10 Standardized residuals 0.5 1.0 1 12 15 0.0 0 17 0 100 200 300 population.density.4 400 500 0.5 Cook's distance 18 19 Fitted values 20 21 0.0 0.1 0.2 Leverage 0.3 Population Density The p-value for the data without outliers is a very insignificant 0.68. A Box-Cox transform was attempted, but the p-value did not get close to significance. Conclusion: Population density and GDP are essentially unrelated. Literacy Rate Final model: GDP= -3.320 + .0657(literacy rate) Unemployment Rate Final model: GDP= 1.388 -.0236(unemployment rate) Oil Consumption – Production Final model: GDP= -3.320 + .0657(literacy rate) Cell phones vs. Landlines Final model: GDP= 1.52811 - .0928(cells vs landlines) Military Expenditures Analysis Doesn’t pass conditions for regression Data isn’t linear Residuals aren’t random Q-Q plot is curved Outliers Analysis of Box-Cox Model Doesn’t pass conditions for regression Data isn’t linear Area Analysis Doesn’t pass conditions for regression Data isn’t linear Residuals aren’t random Q-Q plot is curved Outliers Analysis of Box-Cox Model Doesn’t pass conditions for regression Data isn’t linear Residuals are not random Q-Q plot isn’t normal Sex Ratio Analysis Doesn’t pass conditions for regression Data isn’t linear Residuals aren’t random Q-Q plot is curved Outliers Analysis of Box-Cox Model Doesn’t pass conditions for regression Data isn’t linear Residuals are not random Q-Q plot isn’t normal External Debt Analysis Doesn’t pass conditions for regression Data isn’t linear Residuals aren’t random Q-Q plot is curved Outliers Analysis of Box-Cox Model Doesn’t pass conditions for regression Data isn’t linear Residuals are not random Q-Q plot isn’t normal Multicollinearity Multicollinearity occurs when two explanatory variables are linearly related. A stepwise regression will conclude both are significant, even though the model would work just as well with only one. Variance inflation factors between each pair of explanatory variables were found, and none were too high. There is no significant multicollinearity. Multiple Regression Taking into account all 12 variables at once High R2 Not accurate In our data: Too many variables Too few observations Stepwise Regression Stepwise regression model: predicted GDP = -6.499e+01 + 2.296(median age) + 9.385(population growth) + 9.723e-04(external debt) + 1.808e-03(population density) R-squared 80.78% of the variability in GDP per capita is accounted for by the linear association with median age, population growth, external debt, and population density Removing Outliers One influential outlier Singapore Very high population density Small country with a lot of people financially well to do Stepwise Model w/o Outlier New model after removing Singapore predicted GDP = -6.277e+01 + 2.257(median age) + 8.885(population growth) + 9.274e-04(external debt) + 2.232e-03(population density) R-squared 83.89% of the variability in GDP per capita is accounted for by the linear association with median age, population growth, external debt, and population density Box-Cox Transformation Box-Cox Model New Model (all data points) ((predicted GDP)^(0.5)-1) / (0.5) = 1.388e+01 + 5.560e-01(median age) + 1.915(population growth) + 1.665e04(external debt) + 2.228e-04(population density) R-squared 82.8% of the variability in GDP per capita is accounted for by the linear association with median age, population growth, external debt, and population density Box-Cox w/o Outlier New model after removing Singapore ((predicted GDP)^(0.5)-1) / (0.5) = 1.258e+01 + 5.382e-01(median age) + 1.686(population growth) + 1.682e04(external debt) – 3.106e-03(population density) R-squared 87.35% of the variability in GDP per capita is accounted for by the linear association with median age, population growth, external debt, and population density Final Model Box-Cox model without outlier ((predicted GDP)^(0.5)-1) / (0.5) = 1.258e+01 + 5.382e-01(median age) + 1.686(population growth) + 1.682e04(external debt) – 3.106e-03(population density) Greece Observed GDP: 30.6 Predicted GDP: 34.6