Download Presentation - personal.stevens.edu

Document related concepts
no text concepts found
Transcript
The Wealth of Nations
Jamie Brabston
Matt Caulfield
Mark Testa
Overview






Introduction
Regression of Individual Variables
Multicollinearity
Multiple Regression
Stepwise Regression
Final Model
Introduction


Collected data for 30 countries
12 variables


Life expectancy, median age, population
growth, population density, literacy rate,
unemployment rate, oil consumption – oil
production, cell phone / land line, military
expenditures, area, sex ratio, external debt
Goal: create a model to predict GDP per
capita
Life Expectancy
Residuals vs Fitted
Normal Q-Q
8
4
8
Standardized residuals
0
1
2
20
50
4
6
-1
-10
40
Residuals
0
10
6
0
10
20
-2
-1
0
1
2
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
4
8
4
8
Standardized residuals
0
1
2
6
1
11
0.5
-1
10
30
Fitted values
Standardized residuals
0.5
1.0
20
1.5
GDP
30
-10
0.0
0
-10
50
55
60
65
70
life.expectancy
75
80
0.5
Cook's distance
0
10
Fitted values
20
30
0.0
0.1
0.2
Leverage
0.3
Life Expectancy




Analysis: R2: 0.45. P-value: Highly
significant.
An outlier was identified using a
Leverage-residual plot and removed.
Residuals vs. Fitted Values plot showed
nonlinearity.
Tried a Box-Cox transform.
Life Expectancy
Leverage-residual plot
- Top: Influential data points.
0.35
0.30
0.25
influence
0.20
Upshot: Eliminate points in the top
right quadrant as influential outliers.
14
26
0.15
- Right: Outliers.
0.10
- Left: Non-outliers.
2
0.05
- Bottom: Non-influential data points.
4
28
0.0
18
30
20
1
5 15
11
22
19 29
12279
8
17
2413
10
23
21 16325 7
6
0.5
1.0
1.5
2.0
Absolute value of externally studentised residuals
2.5
Life Expectancy
300
Profile likelihood
200
250
This plot shows the
goodness of the fit as a
function of p. In this
case, the optimal p is
fairly small.
350
Box-Cox plot
150

Box-Cox Transform: y -> (yp - 1)/p
Produces linear fit if variables are
related by a power law.
100

-4
-2
0
p
2
4
Life Expectancy
Linear regression was done on the BC
transformed data. Significant nonlinearity
remained.
Normal Q-Q
4
2.0
Residuals vs Fitted
7
7
Residuals
0
2
12
-2
10
-4
8
-2
0
2
4
6
5
Standardized residuals
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
5
4
8
10
-2
-1
GDP.1bc
6
Fitted values
1.2
4
Residuals vs Leverage
1
2
7
Standardized residuals
-1
0
1
0.2
75
80
13
0.5
Cook's distance
0
2
4
6
Fitted values
8
10
0.5
25
0.0
70
life.expectancy.1
2
5
Standardized residuals
0.4 0.6 0.8 1.0
4
2
65
1
4
-2
60
0
4
Theoretical Quantiles
Scale-Location
0

0.00
0.05
0.10
0.15 0.20
Leverage
0.25
0.30
Life Expectancy



Conclusions: Clearly, there is a
significant positive relationship between
per capita GDP and life expectancy.
We could not identify the precise nature
of the relationship.
This prevents extrapolation and
prediction.
Median Age
Residuals vs Fitted
Normal Q-Q
8
2
20
8
11
Residuals
0
10
-20
-1
-10
40
11 25
Standardized residuals
0
1
50
25
30
0
10
20
30
40
-2
GDP
Fitted values
1
2
Residuals vs Leverage
8
11
0.5
8
11
Standardized residuals
-1
0
1
Standardized residuals
0.5
1.0
20
2
25
12
-2
Cook's distance
0
0.0
10
0
Theoretical Quantiles
Scale-Location
1.5
-1
0
20
25
30
median.age
35
40
10
20
Fitted values
30
40
0.00
0.04
0.08
Leverage
0.12
Median Age



Analysis: R2: 0.58. P-value: Highly
significant.
No suspected outliers.
The plot of Residuals vs. Fitted values is
approximately linear, but significantly
deviated from normal.
Median Age

Box-Cox Transform gives:
Normal Q-Q
2
Residuals vs Fitted
25
Standardized residuals
-1
0
1
-1
4
Residuals
0
1
5
2
25
12
12
-2
-2
7
2
1.5
GDP.2bc
3
1
3
4
5
7
-2
-1
0
1
2
Fitted values
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
25
7
0.5
2
25
Standardized residuals
-1
0
1
26
-2
1
2
Standardized residuals
0.5
1.0
12
7
0.0
Cook's distance
1
20
25
30
median.age
35
40
2
3
4
Fitted values
5
0.00
0.04
0.08
Leverage
0.5
0.12
Median Age




Box-Cox transform significantly improved
the normality of the residual distribution.
The Box-Cox p = 0.15.
R2 is improved to 0.72.
Final Model:
(GDP0.15 – 1)/0.15 = -2.1 + 0.17(Med.Age)
Population Growth
11
Normal Q-Q
25
8
-20
30
11
15
20
25
-2
-1
0
1
Fitted values
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
2
25
20
1.5
GDP
10
25
8
Standardized residuals
0.5
1.0
Standardized residuals
0
1
2
11
1
11 8
0.5
-1
10
8
-1
-10
40
Residuals
0
10
20
Standardized residuals
0
1
2
50
30
Residuals vs Fitted
25
0
0.0
Cook's distance
10
0
1
2
population.growth
3
4
15
20
Fitted values
25
0.00
0.05
0.10
0.15 0.20
Leverage
0.5
0.25
0.30
Population Growth



Analysis: R2 = 0.058. p-value: 0.11.
Correlation is very low, and the p-value
is outside any reasonable significance
level.
An outlier was found and eliminated
using a Leverage-Residual plot.
Population Growth
Box-Cox Transform:
2.0
Residuals vs Fitted
Normal Q-Q
1.5
6
15
1.0
Standardized residuals
0
1
6
1.5
2.0
2.5
3.0
4.0
-2
-1
0
1
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
0.2
Standardized residuals
-1
0
1
Standardized residuals
0.4 0.6 0.8 1.0 1.2
0.5
6
25
12
0.0
0.5
1.0
1.5
population.growth.3
2.0
2.5
Cook's distance
-2
0.0
-0.5
2
11
15
1
3.5
Fitted values
1.4
2
1.0
2
GDP.3bc
-1.5
-1
3
Residuals
-0.5 0.0 0.5
4
11
2
11
15
1.0
1.5
2.0
2.5
3.0
Fitted values
3.5
4.0
0.00
0.05
0.10
Leverage
5
0.5
0.15
0.20
Population Growth



A Box-Cox transform improved the
nonlinearity slightly, and gave a
significant p-value.
From this, we concluded that population
growth has a slight negative relationship
with GDP.
No detailed predictions are possible
because significant nonlinearity remains.
Population Density
40
Residuals vs Fitted
Normal Q-Q
8
50
2
30
8
25
GDP
1.5
11
-1
-10
-20
20
30
6
Standardized residuals
0
1
22
40
Residuals
0
10
20
6
30
35
40
45
-2
-1
0
1
2
Fitted values
Theoretical Quantiles
Scale-Location
Residuals vs Leverage
8
8
2
6
Standardized residuals
0
1
11
11
1
0.5
0.5
1
-1
10
Standardized residuals
0.5
1.0
20
6
0.0
Cook's distance
0
20
0
1000
2000
3000
4000
population.density
5000
6000
25
30
35
Fitted values
40
45
0.0
0.2
0.4
0.6
Leverage
0.8
Population Density


Analysis: The outlier on the far right
corresponds to Singapore, a country
with an exceptionally high population
density.
A less extreme outlier is China. Both of
these data points were removed.
Population Density
Residuals vs Fitted
Normal Q-Q
7
2
50
30
7
6
Standardized residuals
0
1
20
Residuals
0
10
-1
-10
40
-20
12
12
17
GDP.4
30
6
18
19
20
21
-2
-1
Fitted values
Scale-Location
1.5
0
1
2
Theoretical Quantiles
Residuals vs Leverage
7
2
20
7
6
Standardized residuals
0
1
0.5
18
-1
10
Standardized residuals
0.5
1.0
1
12
15
0.0
0
17
0
100
200
300
population.density.4
400
500
0.5
Cook's distance
18
19
Fitted values
20
21
0.0
0.1
0.2
Leverage
0.3
Population Density



The p-value for the data without
outliers is a very insignificant 0.68.
A Box-Cox transform was attempted,
but the p-value did not get close to
significance.
Conclusion: Population density and GDP
are essentially unrelated.
Literacy Rate
Final model:
GDP= -3.320 + .0657(literacy rate)
Unemployment Rate
Final model: GDP= 1.388 -.0236(unemployment rate)
Oil Consumption –
Production
Final model:
GDP= -3.320 + .0657(literacy rate)
Cell phones vs. Landlines
Final model: GDP= 1.52811 - .0928(cells vs landlines)
Military Expenditures
Analysis

Doesn’t pass conditions for regression




Data isn’t linear
Residuals aren’t random
Q-Q plot is curved
Outliers
Analysis of Box-Cox Model

Doesn’t pass conditions for regression

Data isn’t linear
Area
Analysis

Doesn’t pass conditions for regression




Data isn’t linear
Residuals aren’t random
Q-Q plot is curved
Outliers
Analysis of Box-Cox Model

Doesn’t pass conditions for regression



Data isn’t linear
Residuals are not random
Q-Q plot isn’t normal
Sex Ratio
Analysis

Doesn’t pass conditions for regression




Data isn’t linear
Residuals aren’t random
Q-Q plot is curved
Outliers
Analysis of Box-Cox Model

Doesn’t pass conditions for regression



Data isn’t linear
Residuals are not random
Q-Q plot isn’t normal
External Debt
Analysis

Doesn’t pass conditions for regression




Data isn’t linear
Residuals aren’t random
Q-Q plot is curved
Outliers
Analysis of Box-Cox Model

Doesn’t pass conditions for regression



Data isn’t linear
Residuals are not random
Q-Q plot isn’t normal
Multicollinearity




Multicollinearity occurs when two
explanatory variables are linearly related.
A stepwise regression will conclude both
are significant, even though the model
would work just as well with only one.
Variance inflation factors between each
pair of explanatory variables were found,
and none were too high.
There is no significant multicollinearity.
Multiple Regression

Taking into account all 12 variables at
once

High R2

Not accurate

In our data:

Too many variables

Too few observations
Stepwise Regression

Stepwise regression model:


predicted GDP = -6.499e+01 +
2.296(median age) + 9.385(population
growth) + 9.723e-04(external debt) +
1.808e-03(population density)
R-squared

80.78% of the variability in GDP per capita is
accounted for by the linear association with
median age, population growth, external debt,
and population density
Removing Outliers

One influential outlier

Singapore


Very high population density
Small country with a lot of people financially
well to do
Stepwise Model w/o Outlier

New model after removing Singapore


predicted GDP = -6.277e+01 +
2.257(median age) + 8.885(population
growth) + 9.274e-04(external debt) +
2.232e-03(population density)
R-squared

83.89% of the variability in GDP per capita is
accounted for by the linear association with
median age, population growth, external debt,
and population density
Box-Cox Transformation
Box-Cox Model

New Model (all data points)


((predicted GDP)^(0.5)-1) / (0.5) = 1.388e+01 + 5.560e-01(median age) +
1.915(population growth) + 1.665e04(external debt) + 2.228e-04(population
density)
R-squared

82.8% of the variability in GDP per capita is
accounted for by the linear association with
median age, population growth, external debt,
and population density
Box-Cox w/o Outlier

New model after removing Singapore


((predicted GDP)^(0.5)-1) / (0.5) = 1.258e+01 + 5.382e-01(median age) +
1.686(population growth) + 1.682e04(external debt) – 3.106e-03(population
density)
R-squared

87.35% of the variability in GDP per capita is
accounted for by the linear association with
median age, population growth, external debt,
and population density
Final Model

Box-Cox model without outlier


((predicted GDP)^(0.5)-1) / (0.5) = 1.258e+01 + 5.382e-01(median age) +
1.686(population growth) + 1.682e04(external debt) – 3.106e-03(population
density)
Greece


Observed GDP: 30.6
Predicted GDP: 34.6
Related documents