Download reg gnipc daysopen

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
STATA APPLICATIONS
Task 1 last year- Computer assignment
The data set busind.dta contains information on Gross National Income
(GNI) per capita and the number of days to open a business and to
enforce a contract in a sample of 135 countries. It was extracted
from the “Doing Business” dataset, a dataset collected by the World
Bank based on expert opinions in each country. The variable gnipc
measures GNI per capita in thousand $. The variable daysopen
measures the average number of days needed to open a business in
that country, and daysenforce measures the average number of days
needed to enforce a given type of contract.
(i) Find the average GNI per capita and the average number of days to
open a business, and the average number of days to enforce a
contract.
Answer to question (i)
 Stata command:
use busind,clear
su daysenforce daysopen gnipc
Variable
Obs
Mean
daysenforce
daysopen
gnipc
129
135
135
352.9612
50.75556
6.56983
Std. Dev.
Min
Max
162.1636
38.42408
10.02707
27
2
.09
909
203
43.35
 (ii) In how many countries does it take on average less
than 5 days to open a business? What is the maximum
number of days to open a business in the dataset? In
which countries does it take more than 200 days to open a
business?
Answer to Question (ii)
 Stata command:
su daysopen if daysopen<=5
Variable
Obs
Mean Std. Dev.
Min
Max
daysopen
4
3.5 1.290994
2
5
list country if daysopen>=200
country
135.
Haiti
Question (iii)
 Estimate the following simple regression model:
gnipc   0  1daysopen  u
 Give a careful interpretation of estimates b1 and b0. Are
the signs what you expected them to be?
Answer to Question (iii)
 Stata commands:
reg gnipc daysopen
Source
SS
df
MS
Model
Residual
1766.98652
11705.6636
1 1766.98652
133 88.0125084
Total
13472.6501
134 100.542165
gnipc
Coef.
daysopen
_cons
-.0945063
11.36655
Std. Err.
.0210919
1.340889
t
-4.48
8.48
Number of obs
F( 1, 133)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
=
=
=
=
=
=
135
20.08
0.0000
0.1312
0.1246
9.3815
[95% Conf. Interval]
-.1362253
8.714322
-.0527873
14.01878
Question (iv)
 Question: What kind of factors are contained in u? Are
these likely to be correlated with the number of days
to open a business?
 Answer: Factors contained in u are factors that explain
the GNI par capita apart from the number of days to
open a business. You might be conscious that there are
many other factors, such as economic institutions,
education, savings, consumption, R&D… Some
factors are likely to be correlated with the number of
days to open a business, such as the quality of
economic institutions.
Question (v)
 Question: What is according to this model the predicted
.
income for a country where it takes 5 days to open a
business? And the predicted income for a country where it
takes 200 days to open a business? Show how you can
calculate the answers by hand (once you have obtained the
estimation results). Do the obtained levels of income seem
reasonable? Explain.
Answer to Question (v)
 You can compute predicted values for the dependent
variable in two ways: by “displaying” gniˆpc  ˆ0  ˆ1* daysopen
when daysopen=5 and daysopen=200
Stata commands:
display _b[daysopen]*5+_b[_cons]
10.894018
display _b[daysopen]*200+_b[_cons]
-7.5347099
Answer to question (v)
 or by generating the fitted value of the dependent
variable :
reg gnipc daysopen
predict gnipc_hat
. list gnipc_hat if
daysopen==5
gnipc_~t
4.
10.89402
. list gnipc_hat if
daysopen==200
A problem arises with this second method as there is no
observation with daysopen=200, so that it is impossible
to get the value of gnipc_hat for daysopen=200.
 To illustrate our fitted values, we can draw the OLS regressio
-10
0
10
20
30
40
line:
scatter gnipc daysopen||lfit gnipc daysopen
0
50
100
150
number of days to open a business
gross national income ('000 US $)
Fitted values
200
Question (vi)
Estimate the following simple regression model
and give a careful interpretation of 1.
gnipc  0  1daysenforce  u
Answer to Question (vi)
 Stata command:
reg gnipc daysenforce
Source
SS
df
MS
Model
Residual
2768.61703
10372.442
1 2768.61703
127 81.6727718
Total
13141.0591
128 102.664524
gnipc
Coef.
daysenforce
_cons
-.0286796
16.66315
Std. Err.
.0049258
1.912056
t
-5.82
8.71
Number of obs
F( 1, 127)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
=
=
=
=
=
=
129
33.90
0.0000
0.2107
0.2045
9.0373
[95% Conf. Interval]
-.0384269
12.87954
-.0189322
20.44676
Question (viii)
. reg lngnipc daysopen
Source
SS
df
MS
Model
Residual
42.8638154
311.155695
1 42.8638154
133 2.3395165
Total
354.01951
134 2.64193664
lngnipc
Coef.
daysopen
_cons
-.0147194
1.4396
Std. Err.
.0034388
.218617
t
-4.28
6.59
Number of obs
F( 1, 133)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
=
=
=
=
=
=
135
18.32
0.0000
0.1211
0.1145
1.5295
[95% Conf. Interval]
-.0215212
1.007184
-.0079176
1.872016
Question (vii)
 Comparing the estimates of the models in (iii) and (v), which one
explains more of the variation in income per capita across countries.
Can you infer whether the duration to open a business or the duration
for enforcing contracts is more strongly correlated with income per
capita?
 Answer: How much of the variation of GNI per capita (y) is explained
by an independent variable is given by the R2. The greater the R2, the
more variation of y is explained by x. The R2 of the regression of GNI
per capita on the number of days to open a business is about 13% and
the R2 of the regression of GNI per capita on the number of days to
enforce a contract 21%. That means that this variable explains more of
the variation of the gni per capita than the former. It means that the
duration for enforcing contract is more strongly correlated with
income per capita than the number of days to open a business. Here,
the correlation between gnipc and daysenforce is equal to -0.46 and the
correlation between gnipc and daysopen is equal to -0.36.
Question (viii)
 Estimate the following simple regression model
and give a careful interpretation of 1.
log( gnipc )   0  1daysopen  u
Answer to Question (viii)
 Stata commands:
gen lngnipc=ln(gnipc)
reg lngnipc daysopen
 Do these results allow you to draw conclusions
regarding the desirability of policies aimed at reducing
the number of days for opening a business in certain
developing countries?
 The dataset contains 135 countries, and hence does
not contain information about all the countries in the
world. Do you think one should account for that
when interpreting the regression results. Why?
Task 2 last year- Computer exercise
 The dataset nepalind.dta contains data from 706 children of
15 years old in Nepal. The data come from the 2003 Nepal
Living Standard Survey (NLSS) Living Standard
Measurement Survey (LSMS). We want to analyze this data
to understand the number of years of education. Illiteracy
and low levels of education are a major concern in Nepal, so
it would be good to know which type of factors could be
explaining education of the present generation, to know
what type of policies to implement. The dataset has some
information on household characteristics and
characteristics of the child, and of the household head.
The NLSS is a LSMS-type survey, which are country-wide representative
surveys that statistical offices in developing countries conduct with the
support of the World Bank to determine poverty levels, determinants
of poverty, etc. See www.worldbank.org/lsms for more info.
Question 1
 Write a paragraph describing the dataset
using the standard descriptive statistics (also
called summary statistics, or “D-stats”). Add a
table with the d-stats.
. su
Variable
Obs
Mean
r2_sex
r2_healths~t
nrchild
nractad
nrold
706
655
706
706
706
1.473088
1.309924
3.478754
3.002833
.3427762
head_age
head_educ
value_jewe~y
distschool
r2_supown
706
706
706
656
706
educ
706
Std. Dev.
Min
Max
.4996292
.4726228
1.739072
1.487534
.6291367
1
1
1
0
0
2
3
14
13
5
46.56232
2.827195
13984.75
.2925051
.7404253
10.56877
4.070464
26725.61
.3067064
1.053909
15
0
0
.0166667
0
85
14
400000
2.5
9.811792
5.51983
3.565441
0
16
Question (1)
Child characteristics
Male (%)
52
Health status (%)
Good
69.5
Fair
30
Poor
0.5
Years of education
5.5 (3.6)
Question (1)
Household characteristics
Number of household members
6.8 (2.73)
under 18 years old
3.5 (1.74)
between 18 and 59
3.0 (1.49)
60 or older
0.3 (0.63)
Age of the head
46 (10.5)
Education of the head
2.8 (4.1)
Land owned (in ha)
0.74 (1.05)
Value of jewelries (in rupees)
13985 (26726)
Distance to school (in hours)
0.29 (0.31)
Number of observations
Standard errors into parenthesis
706
Question 2: Show the distribution of the different values of
years of education in the dataset. Drop the variables that
have values higher than 10. Explain why that might be a
smart thing to do, before doing any regression analysis.
. hist educ,discrete
(start=0, width=1)
 Question (3): Specify a model that allows explaining
the number of years of education as a function of
father’s age, the number of active adults (between 18
and 60 years old) and the number of elderly (60 or
older) and all other variables you think are interesting
and appropriate.
Make sure only to include variables that are exogenous
and discuss why the variables you include can be
considered exogenous. Estimate the model and give a
careful interpretation of each of the coefficients (sign,
size, and significance!). Do you find any of your results
counterintuitive?
Tips to answer question (3)
 Each variable that you add into the model must be related to
educ in some way, and should not violate the ZCM
assumption=>they must be exogenous=>ask yourself:
 x caused by y? i.e. possibility of reverse causality?
 One third factor determines both x and y? in this case
correlation is not causation, and x is not exogenous.
 u and x related for some other reason?
 Gender? Head´s age? Nb of active adults? Number
of elderly? Head´s education? Land owned?
distance to school? Value jewelry? Nb of children?
Health?
A reasonable model to estimate:
educ   0  1headage   2 headeduc  3nractad   4 nrold  5 sup own   6 dist   7 female  u
 Expected signs of coefficients? Argue.
. do "C:\Users\Yaya\AppData\Local\Temp\STD03000000.tmp"
. drop if educ>10
(9 observations deleted)
. ge male= r2_sex==1
. reg
//we create a dummy, =1 if r2_sex equals 1
educ head_age head_educ nractad nrold r2_supown distschool male
Source
SS
df
MS
Model
Residual
1463.51225
6628.93151
7
641
209.073179
10.3415468
Total
8092.44376
648
12.4883391
educ
Coef.
head_age
head_educ
nractad
nrold
r2_supown
distschool
male
_cons
.033665
.3317472
-.093363
.1106236
.3162876
-.5371565
1.061818
2.509268
Std. Err.
.0135985
.0339146
.0904375
.2244013
.1255367
.418622
.2538363
.6762122
t
2.48
9.78
-1.03
0.49
2.52
-1.28
4.18
3.71
Number of obs
F( 7,
641)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.014
0.000
0.302
0.622
0.012
0.200
0.000
0.000
=
=
=
=
=
=
649
20.22
0.0000
0.1808
0.1719
3.2158
[95% Conf. Interval]
.0069621
.2651501
-.2709525
-.3300268
.0697746
-1.359193
.5633668
1.181409
.0603679
.3983444
.0842265
.5512741
.5628006
.2848797
1.560269
3.837127
Question 4:What is the minimum significance level at which one can
reject that hypothesis that age of the household head does not affect
education levels?
 The p-value gives the smallest significant level at which an
hypothesis H0 can be rejected. In other words, a low p-value
indicates that the tested hypothesis is unlikely. The minimum
significance level at which one can reject the hypothesis that the
age of the household head does not affect education levels is
given by the p-value of the test β1 =0. Then, one can directly
read on the stata output that this minimum significance level is
1.4%.
Question (5)
 Do your results allow you to conclude that the
effects of the number of active adults in the
household is different than the effect of
elderly? State the null hypothesis and the
alternative hypothesis you are testing, and the
significance level you are considering. Does
your answer differ depending on which
significance level you consider?
Answer to Question (5)
 Need to test null hypothesis: H0: β3=β4 against H1:β3≠β4
 You just need command "test".
. test nractad = nrold
( 1)
nractad - nrold = 0
F(
1,
641) =
Prob > F =
0.72
0.3965
Question (6)
 Test whether the characteristics of the household head are
jointly significant. Show how to do this in stata, and calculate
the test by hand in 2 different ways. What can you conclude
about the role of household head characteristics on
education of the children?
Answer to Question (6)
. test (head_age=0)(head_educ=0)
( 1)
( 2)
head_age = 0
head_educ = 0
F(
2,
641) =
Prob > F =
47.96
0.0000
Question 6: compute F-test
 Run the unrestricted and restricted models, and compute either
SSR or R2 form of the F-statistic.
( Rur2  Rr2 ) / q
F
(1  Rur2 ) /( n  k  1)
 reg educ head_age head_educ nractad nrold r2_supown distschool male
scalar r2_ur=e(r2)
scalar df=e(df_r)
reg educ nractad nrold r2_supown distschool male
scalar r2_r=e(r2)
. dis ((r2_ur-r2_r)/2)/((1-r2_ur)/df)
47.961109
Question (9)
Non missing
Missing
53
48
Good
69.5
74
Fair
30
26
Poor
0.5
.
Years of education
5.3
6.8
Child characteristics
Male (%)
Health status (%)
Question (9)
Non missing
Missing
6.9
6.1
under 18 years old
3.5
2.8
between 18 and 59
3.0
2.9
60 or older
0.3
0.4
Age of the head
46.5
48.2
Education of the head
2.6
5.5
Land owned (in ha)
0.77
0.29
12212
35488
Distance to school (in hours)
0.29
.
Number of observations
600
46
Household characteristics
Number of household members
Value of jewelries (in rupees)