Download The Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
OPIM5103 – Term Paper Assignment
Sample Student
OPIM5103 – Term Paper Assignment
Sample Student
1
OPIM5103 – Term Paper Assignment
Sample Student
Table of Contents
Introduction ..................................................................................................................... 3
Literature Review ............................................................................................................ 3
The Data ......................................................................................................................... 3
Relationship Analysis .................................................................................................... 17
Conclusion .................................................................................................................... 26
2
OPIM5103 – Term Paper Assignment
Sample Student
Introduction
This paper analyzes the individual relationships between the dependent variable
of “Gross Domestic Product/Capita “(GDP/CAP) in 110 countries around the world and
the following explanatory variables.
1. Population in thousands (POPULATN)
2. Percentage of people who read and (LITERACY)
3. The birth rate per thousand people (BIRTH_RT)
Throughout the paper, I will try and identify the influence of any or all of the abovementioned explanatory variables over the value of GDP/CAP for each country.
Literature Review
In my research for existing statistical analysis on the relationships between the
above mentioned dependent and explanatory variables, I came across the following
material.
1. “Evolutionary Theories of Long-Run World Economic History: The Theory/History
Interconnection Re-Examined
http://www.helsinki.fi/iehc2006/papers3/Korotayev.pdf
This paper discusses the relationship between world GDP and population as well
as literacy levels and concludes that an increase in literacy levels leads to an
increase in GDP and population and birth rate also plays an important role in this.
The Data
I have used the States data from “WORLD95.XLS” data file source associated
with the website for this course for the analysis in this paper. Here is some more
information about the data being analyzed.
3
OPIM5103 – Term Paper Assignment
Sample Student
GDP/CAP: This data element represents the per capita gross domestic product
for a country and is the dependent variable for the purpose of this paper. The
GDP/CAP variable is measured in terms of an absolute numeric index value.
The table below provides descriptive statistics information about this data
element and is followed by a Frequency Table and Histogram to provide a sense
of the frequencies and percentage distributions for the variable. There is a big
difference between the mean GDP/CAP value, its median and mode. The mean
5859.98 is almost double its median (2995) which indicates a positive or rightskewness in the GDP data. The Pearson Measure of Skewness score measured
as the (Mean-Median)/SD is .4421 which is way more than .1 and therefore tells
us that the data is not symmetrical. The box and whisker diagram for GDP/CAP
on the page after the next page provides a good idea of the skewness
characteristic of the data.
GDP/CAP Descriptive Statistics
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
5859.981651
620.6557167
2995
1500
6479.835919
41988273.54
-0.028158742
1.145650254
23352
122
23474
638738
109
4
OPIM5103 – Term Paper Assignment
Sample Student
GDP/CAP Histogram
Interval
Frequency
0 - Less than 1600
1600 - Less than 3200
3200 - Less than 4800
4800 - Less than 6400
6400 - Less than 8000
8000 - Less than 9600
9600 - Less than 11200
11200 - Less than 12800
12800 - Less than 14400
14400 - Less than 16000
16000 - Less than 17600
17600 - Less than 19200
19200 - Less than 20800
20800 - Less than 22400
22400 - Less than 24000
42
17
7
5
12
2
0
1
4
4
6
5
2
1
1
Percentage
38.53%
15.60%
6.42%
4.59%
11.01%
1.83%
0.00%
0.92%
3.67%
3.67%
5.50%
4.59%
1.83%
0.92%
0.92%
Cumulative %
Midpts
38.53%
54.13%
60.55%
65.14%
76.15%
77.98%
77.98%
78.90%
82.57%
86.24%
91.74%
96.33%
98.17%
99.08%
100.00%
-800
2400
4000
5600
7200
8800
10400
12000
13600
15200
16800
18400
20000
21600
Bins
5
22400
19200
16000
12800
9600 -
6400 -
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
3200 -
45
40
35
30
25
20
15
10
5
0
0-
Frequency
GDP/CAP Histogram
Frequency
Cumulative %
OPIM5103 – Term Paper Assignment
Sample Student
The frequency table and histogram above clearly show that more than 38% of the
countries fall in the lowest interval of GDP/CAP and about 65% of the countries fall
below the mean GDP/CAP value of 5860. This tells us that that the median value of
2995 may be a better measure of the center for the GDP/CAP data because there are
too many outliers in either direction of the mean. The box and whisker plot below also
shows that the outlier value of 23474 makes the right whisker longer than the left. This
depicts that the data is not symmetrical at all.
GDP/CAP Box & Whisker Plot
GDP_CAP
120
5120
10120
15120
20120
25120
The standard deviation for GDP/CAP is 6479.835919 and the mean is 5860
approximately. Considering the fact that most data values for a distribution
usually lie within 1 standard deviation of the mean, we can conclude that the
6
OPIM5103 – Term Paper Assignment
Sample Student
GDP/CAP data is no exception to the rule because 79% of the data lies in this
interval.
POPULATN: This variable provides a measure of the population of each of the
109 countries in the data file. The unit of measure is in “thousands”. Mentioned
below are the descriptive statistics for this variable followed by a normal
probability plot for it. The mean value (47724 approx.) exceeds the median
(10400). The mean is more than 4 times the median. More than 77% of the
values are below the mean and about 45% of the values are below the median.
This tells us that the median may be a better measure of the center of this data.
The box and whisker plot on the page after next indicates a positive, right
skewness in the POPULATN data. The Pearson Measure of Skewness score is
(47723.88-10400)/146726 = .254 which is greater than .1 indicating that the
values for POPULATN are skewed.
POPULATN
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
47723.88073
14053.83679
10400
2900
146726.3637
21528625814
46.65097372
6.592335985
1204944
256
1205200
5201903
109
7
OPIM5103 – Term Paper Assignment
Sample Student
The standard deviation for POPULATN is 146726 and more than 88% of the values for
the POPULATN variable lie within 1 standard deviation of the mean value of 47724. The
left whisker on the box & whisker plot is not visible at all and the right whisker is longer
due to a high outlier value in the maximum for POPULATN being 1205200. The
distribution is not symmetrical at all.
Frequencies (POPULATN)
Intervals
Bins
0 Less Than 5000
5000 Less Than 10000
10000 Less Than 15000
15000 Less Than 20000
20000 Less Than 25000
25000 Less Than 30000
30000 Less Than 35000
35000 Less Than 40000
40000 Less Than 45000
45000 Less Than 50000
50000 Less Than 55000
55000 Less Than 60000
60000 Less Than 65000
65000 Less Than 70000
70000 Less Than 1210000
4999
9999
14999
19999
24999
29999
34999
39999
44999
49999
54999
59999
64999
69999
1E+06
Frequency
27
22
14
6
7
4
1
3
1
1
1
5
2
2
13
Percentage
24.77%
20.18%
12.84%
5.50%
6.42%
3.67%
0.92%
2.75%
0.92%
0.92%
0.92%
4.59%
1.83%
1.83%
11.93%
Cumulative %
Midpts
24.77%
44.95%
57.80%
63.30%
69.72%
73.39%
74.31%
77.06%
77.98%
78.90%
79.82%
84.40%
86.24%
88.07%
100.00%
-2500
7500
12500
17500
22500
27500
32500
37500
42500
47500
52500
57500
62500
67500
30
25
20
15
10
5
0
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
49
9
14 9
99
24 9
99
34 9
99
44 9
99
54 9
99
64 9
12 99
09 9
99
9
Frequency
POPULATN Histogram
Bins
8
Frequency
Cumulative %
OPIM5103 – Term Paper Assignment
Sample Student
POPULATN - Box & Whisker plot
POPULATN
250
200250
400250
600250
9
800250
1000250
1200250
OPIM5103 – Term Paper Assignment
Sample Student
LITERACY: This variable defines the number of people who are able to read in a
country. It is represented as a percentage value and the table below provides the
descriptive statistics for LITERACY.
LITERACY
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
76.89908257
2.395521396
87
99
25.00997762
625.4989806
0.514929291
-1.151088787
100
0
100
8382
109
The median value of this variable (87) is greater than its mean value (76.90).
About 55% of the population falls under the median value and therefore it
provides a better measure of the center for the literacy data than the mean value
which is slightly distorted due to outliers on the left side of it.
The box and whisker diagram on the page after next indicates a negative, left
skewness in the distribution. The Pearson Measure of Skewness for the
LITERACY value can be calculated as (76.9 – 87)/25 = -.4 an absolute value of
.4 which is further proof that the data is skewed.
The standard deviation for LITERACY is 25.00 and by looking at the frequency
distribution we can without doubt say that 100% of the values lie within 1
standard deviation of the mean (76.89).
10
OPIM5103 – Term Paper Assignment
Sample Student
Frequencies (LITERACY)
Bins
Intervals
0 - Less than 10
10 - Less than 20
20 - Less than 30
30 - Less than 40
40 - Less than 50
50 - Less than 60
60 - Less than 70
70 - Less than 80
80 - Less than 90
90 - Less than 100
100 - Less than 110
Frequency
9
19
29
39
49
59
69
79
89
99
109
Percentage
2
1
5
4
4
10
7
12
15
46
3
Cumulative %
1.83%
0.92%
4.59%
3.67%
3.67%
9.17%
6.42%
11.01%
13.76%
42.20%
2.75%
Midpts
1.83%
2.75%
7.34%
11.01%
14.68%
23.85%
30.28%
41.28%
55.05%
97.25%
100.00%
-5
15
25
35
45
55
65
75
85
95
30
60.00%
Frequency
20
40.00%
Cumulative %
10
20.00%
0
0.00%
89
10
9
80.00%
69
40
49
100.00%
29
50
9
Frequency
LITERACY Histogram
Bins
11
OPIM5103 – Term Paper Assignment
Sample Student
LITERACY- Box & Whisker Plot
LITERACY
-10
10
30
50
12
70
90
110
OPIM5103 – Term Paper Assignment
Sample Student
BIRTH_RT: This variable reflects the birth rate number per thousand people in
the population. Here are the descriptive statistics for this variable. The mean and
median are approximately the same, which tells us that either value could be
used as a measure of centre for the variable. The box and whisker plot on the
next page shows a slight right or positive skewness tendency in the data. The
right whisker is longer than the left one indicating that the data is not perfectly
symmetrical.
The standard deviation is 12.36 and the frequency table on the next page shows
that more than 77% of the values for BIRTH_RT fall within 1 standard deviation
of the mean (25.92).
BIRTH_RT
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
25.92293578
1.183959241
25
13
12.36089737
152.7917839
1.146535118
0.445576718
43
10
53
2825.6
109
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
BIRTH_RT Histogram
Intervals
Bins
0 - Less than 8
8 - Less than 16
16 - Less than 24
24 - Less than 32
32 - Less than 40
40 - Less than 48
48 - Less than 56
7.99
15.99
23.99
31.99
39.99
47.99
55.99
Frequency
(BIRTH_RT)
Percentage
0
35
16
23
11
21
3
0.00%
32.11%
14.68%
21.10%
10.09%
19.27%
2.75%
13
Cumulative %
0.00%
32.11%
46.79%
67.89%
77.98%
97.25%
100.00%
Midpts
-4
12
20
28
36
44
OPIM5103 – Term Paper Assignment
Sample Student
Frequency
BIRTH_RT Histogram
40
35
30
25
20
15
10
5
0
100.00%
80.00%
60.00%
Frequency
40.00%
Cumulative %
20.00%
0.00%
7.99 16 24 32 40 48 56
Bins
Birth_RT Box & Whisker Plot
BIRTH_RT
0
10
20
30
14
40
50
OPIM5103 – Term Paper Assignment
Sample Student
Data Requirements for Multiple Regression Models:
For multiple regression models it is more important that the data in question
a. Does not contain a serial (auto) correlation issue. This is to say that the
error terms for LITERACY, POPULATN and BIRTH_RT in the model
should be independent of each other.
Since the data in my paper is not time series related data, I will assume
that serial correlation is not an issue with it.
b. The data should not have error terms with unequal variances or it should
not be heteroskedastic. It should be homoskedastic or the errors should
have constant variance. I used the prescribed Limedep test again to
prove the same and it recommended that the data had some problems in
this area. Based on my discussion with professor Jantzen, I selected to
not use the corrected results for this paper.
c. The third data requisite for multiple regression states that the error data
should be normally distributed. The normal probability chart for the error
terms on the next page shows that the errors are more or less normally
distributed and therefore I will conclude that I could use the same to
create a multiple regression model for GDP/CAP.
15
OPIM5103 – Term Paper Assignment
Sample Student
Multiple Regression Error - Normal Probability Plot
20000
15000
Residuals
10000
5000
0
-3
-2
-1
0
-5000
-10000
Z Value
16
1
2
3
OPIM5103 – Term Paper Assignment
Sample Student
Relationship Analysis
Since I am trying to identify the influence of the POPULATN, LITERACY and BIRTH_RT
variables on GDP/CAP, I will use the multiple regression method to predict how large or
small the dependent variable will be, given differing values for the explanatory variables.
I will use the standard equation to represent “My Multiple Regression Model”.
Substituting the dependent and explainer variables within this equation, I get the
following equation
Yi(GDP/CAP) = b0 + b1(POPULATN) + b2(LITERACY) + b3(BIRTH_RT) + E
In other words, GDP/CAP in a state (Y) variable can be expressed in terms of a constant
(b0) and a slope (b1) times POPULATN (X1 variable), plus a slope (b2) times LITERACY
(X2 variable) and a slope (b3) times BIRTH_RT(X3 variable).
The output of the regression model using the sample data and the above-mentioned
dependent and explainer variables is presented on the next page.
In the sections following the results, I will concentrate on examining the overall fit of the
regression model and measure how each coefficient in the equation impacts the value
of GDP/CAP.
17
OPIM5103 – Term Paper Assignment
Sample Student
a) The Overall Significance of the Model
This step assesses whether all of the regression coefficients (except the
constant) in the "true" model describing the underlying population are equal to
zero. It proves the overall fitness of the multiple regression model in question.
Yi(GDP/CAP) = b0 + b1(POPULATN) + b2(LITERACY) + b3(BIRTH_RT) + E
The above equation being our model for this paper, we can use the following
hypotheses statements for the overall fitness test.
H0: b1 = b2 = b3 = 0 (no linear relationship between GDP/CAP and either of
the explainer variables)
HA: At least one βi (B1 or B2 or B3) ≠ 0 (at least one independent variable
affects GDP/CAP)
18
OPIM5103 – Term Paper Assignment
Sample Student
In order to validate this we perform and evaluate the results of the F-Test on the
overall regression.
The formula for calculating the F Statistic is as follows
where k = # of explainers & n= sample size.
The F Statistic Value for the model = (.4395/1-.4395) / (3/109-3-1) = 27.41
The critical F statistic using a significance level of 0.05 and (k) or 3 being the
degrees of freedom in the numerator and (n-k-1) or 105 as the degrees of
freedom in the denominator calculates (using the F value calculator on the
OPIM303 class website) as 2.6911
Therefore the F-Statistic value is greater that the critical F and I will reject the
null. This helps me understand that there isn’t sufficient evidence that the
coefficients for the explanatory variables in the underlying population are
all zeros.
Our conclusion from this is that there is a possibility that at least one of the
explanatory variables within POPULATN, LITERACY and BIRTH_RT influences
the value of the dependent variable GDP/CAP.
19
OPIM5103 – Term Paper Assignment
Sample Student
b) Test on a single coefficient
The t test on a single regression coefficient assesses whether a population
regression coefficient is =, not =, >= or <= a particular number. T tests are
conducted for each estimated regression coefficient and typically use a reference
value of zero.
We will test the coefficients for all of our explainer variables namely POPULATN,
LITERACY and BIRTH_RT.
1) To test hypotheses about POPULATN’s regression coefficient (b1), the
following pairs of null and alternative hypotheses could be assessed with the “t test”
(i) H0: b1 = 0
Ha: b1 <> 0
(ii) H0: b1<= 0
Ha: b1 > 0
(iii) H0: b1>= 0
Ha: b1< 0
The value for the t-statistic for POPULATN is stated as -1.604014027 in
the Multiple Regression model output. We use the absolute value of
1.6040 to analyze our hypothesis.
For hypothesis (i) above, we are trying to prove that POPULATN
has some impact on GDP/CAP, either positive or negative.
The two-tail critical t-value to be considered for (n-k-1) or 105 degrees of
freedom and with “.05” significance level is 1.9828 (using the calculator).
20
OPIM5103 – Term Paper Assignment
Sample Student
This tells us that the absolute value of the t-statistic 1.604 is less than the
2-tail critical t-value of 1.9828 and therefore we cannot reject the null
for this hypothesis. This tells us that there is not sufficient evidence
to prove that b1, the coefficient for POPULATN is NOT EQUAL TO
“0”.
We can therefore conclude that we do not have enough evidence to
prove that POPULATN influences the value of GDP/CAP for a
country.
2) To test whether LITERACY influences GDP/CAP, we could use the
following hypotheses.
(i) H0: b2 = 0
(ii)
Ha: b2 <> 0
H0: b2<= 0
Ha: b2 > 0
(iii) H0: b2>= 0
Ha: b2< 0
Bullet (i) above tries to test if LITERACY has any impact, positive or
negative on GDP/CAP.
H0: b2 = 0
HA: b2 <> 0
The absolute value for the t-statistic for LITERACY is stated as
0.683511186 in the results for the Multiple Regression Model.
The two-tail critical t-value (absolute) to be considered for (n-k-1) or 105
degrees of freedom is 1.98217. This value is greater than the t-statistic
value of 0.6835. Therefore we once again cannot reject the null for
21
OPIM5103 – Term Paper Assignment
Sample Student
this hypothesis. This tells us that there isn’t sufficient evidence to
prove that b2, the coefficient for LITERACY is NOT EQUAL TO “0”.
We can therefore conclude that we do not have enough evidence to
prove that LITERACY impacts the value of GDP/CAP for a country.
3) To assess if the BIRTH_RT of a country influences it’s GDP/CAP we
can use the following hypotheses statements.
(i) Ho: b3 = 0
Ha: b3 <> 0
(ii) Ho: b3<= 0
Ha: b3 > 0
(iii) Ho: b3 >= 0
Ha: b3< 0
In bullet (i) above, we try to assess if BIRTH_RT for a country has any
influence, positive or negative, on its GDP/CAP at all.
The absolute value for the t-statistic for BIRTH_RT is stated as 6.09938 in
the multiple regression model results. T
The two-tail critical t-value (absolute) for (n-k-1) or 105 degrees of
freedom is 1.98217. This value is less than the t-statistic value of
6.09938.
Therefore we reject the null for this hypothesis. This allows us to
conclude that there is sufficient evidence that the coefficient for
BIRTH_RT is not equal to 0 and it therefore BIRTH_RT of a country
influences the value of the dependent variable GDP/CAP.
22
OPIM5103 – Term Paper Assignment
Sample Student
In the following sections, we perform the lower and upper tail t-test for
BIRTH_RT to find out if the value is greater than or less than 0.
Hypothesis (ii) mentioned above tries to prove that the value for the
coefficient for BIRTH_RT is > 0 or in other words that BIRTH-RT has a
positive impact on the GDP/CAP.
H0: b3 <= 0
HA: b3 > 0
For this hypothesis we need to run an upper tail t-test. Once again the tvalue in absolute form is given as 6.09937. The absolute value for the
upper-tail critical t is 1.659.
The sample t-value is greater than the critical value but the sample
coefficient value for BIRTH_RT is “-376.928” or a value less than 0. This
coefficient value for b3 does not agree with the alternative hypothesis in
this case, which says that the b3 (coefficient for BIRTH_RT) value is
greater than 0.
Therefore we cannot reject the null which means that there is no
evidence that BIRTH_RT has a positive impact on the GDP/CAP of a
country.
In bullet (iii) above, we hypothesize the following.
23
OPIM5103 – Term Paper Assignment
Sample Student
H0: b3 >= 0
HA: b3 < 0
The t-statistic value from the regression run is |6.09937|. The one-tail
critical-t value from the calculator is |1.6594|. The sample-t value is
greater than the critical-t value. In addition to this, the model coefficient
value for b3 is -376.928, a value that is less than 0. It therefore satisfies
the alternate hypothesis of b3 value being less than 0. We can therefore
reject the null which means that there is sufficient evidence that the
value of b3, the coefficient for BIRTH_RT is less than 0. We can
therefore conclude that BIRTH_RT negatively impacts GDP/CAP for
a country .
24
OPIM5103 – Term Paper Assignment
Sample Student
c) Standardized Coefficients
To explain which variable amongst POPULATN, LITERACY and BIRTH_RT
have a greater influence on GDP/CAP, we need to calculate the standardized
coefficients. The standardized coefficients show how many standard deviations the
GDP/CAP will change if the POPULATN, LITERACY or BIRTH_RT changes by one
standard deviation. Larger standardized coefficients indicate more influence, smaller
ones less. The formula for calculating the standardized coefficient is as follows.
Where bi is the estimated regression coefficient, Sxi is the standard
deviation of the explanatory variable, and Sy is the standard deviation of the dependent
variable.
Standardized coefficient for POPULATN = -0.0052*(146726.364/6479.836) = - 0.1177.
Standardized coefficient for LITERACY = -20.8782*(25.01/6479.836) = - 0.08058.
Standardized coefficient for BIRTH_RT = -376.9277*(12.361/6479.836) = - 0.7190.
The standardized coefficient values calculated and shown above imply the following.

A one SD increase in POPULATN leads to a -0.1177 SD decrease in GDP/CAP.

A one SD increase in LITERACY leads to a -0.0806 SD decrease in GDP/CAP.

A one SD increase in BIRTH_RT leads to a -0.7190 SD decrease in GDP/CAP.
25
OPIM5103 – Term Paper Assignment
Sample Student
Conclusion
This paper tried to evaluate if the Population, Literacy level and Birth Rate for a country
influenced it’s Gross Domestic Product Per Capita. After conducting a test for overall fit
we concluded that there was enough evidence that each one of these explainer
variables had some impact on the gross domestic product, per capita for the country.
We tried to test further to identify the influence that each individual explainer (population,
literacy level, birth rate) had on the dependent variable (gross domestic product per
capita. This is what we concluded from those tests.

There wasn’t sufficient evidence for us to conclude that population had any
impact on gross domestic product/capita.

There wasn’t sufficient evidence to conclude that literacy levels for a country had
any influence on its gross product/capita.

There was sufficient evidence that birth rate of a country had a negative impact
on its gross domestic product/capita.

R Square = Regression Sum of Squares/Total Sum of Squares = .440 approx.
This says that approximately 44% of variation in the GDP/CAP data for a country
can be explained by variations in the values for its POPULATN, LITERACY and
BIRTH_RT.

Adjusted R Square = .4234 or approximately .420. This says that 42% of the
variation in GDP/CAP is explained by the variation in POPULATN, LITERACY
and BIRTH_RT, taking into account the sample size of 109 and 10 and 3 as the
number of independent variables.
26
OPIM5103 – Term Paper Assignment

Sample Student
The confidence intervals for the regression coefficients show how large the
population coefficients are likely to be. Specifically, we're 95% confident that the
"true" marginal effects on GDP/CAP of changes in POPULATN, LITERACY and
BIRTH_RT lie between -0.0116 and 0.0012, -81.4441 and 39.6878, and 499.4613 and -254.3941 respectively.

Note that zero lies within POPULATN and LITERACY variable intervals indicating
that the population regression coefficients could be zeros (hence POPULATN
and BIRTH_RT have no effect on GDP/CAP). We also proved this in our
individual slope evaluation for POPULATN and LITERACY.
27