Download Exercises for for Chapter 1 of Vinod`s “HANDS

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Exercises for for Chapter 1 of Vinod’s
“HANDS-ON INTERMEDIATE
ECONOMETRICS USING R”
H. D. Vinod Professor of Economics,
Fordham University, Bronx, New York 10458
Abstract
These are exercises to accompany the above-mentioned book. The
book’s URL is http://www.worldscibooks.com/economics/6895.
html. Not all of the following exercises are suggested by H. D. Vinod
(HDV) himself. Some are suggested by his students, whose names are
Clifford D. Goss (CDG), Adam R. Bragar (ARB), Jennifer W. Murray (JWM), and Steven Carlsen (SC). The initials are attached to the
individual exercises to identify persons suggesting the exercise and the
answers or hints. Many R outputs are suppressed for brevity. This
set focuses on Chapter 1 of the text.
0
0.1
Exercises Regarding Basics of R
Exercise (R basics, data entry, basic stats)
HDV-1) Define a vector x with elements 2, 7, 11, 3, 7, 222, 34. Find the
mean, median and (the discrete) mode of these data.
#ANSWER
x=c(2, 7, 11, 3, 7, 222, 34) #c is needed by R before numbers
# Minimum, 1st Quartile, Median, Mean, 3rd Qu., Maximum
#are given by the summary function.
summary(x)
#the number 7 is most frequent and hence the mode is at 7
1
#to get R to show such discrete mode use table function
table(x) #for discrete mode
HDV-2)Are the data positively or negatively skewed? (Hint: use ‘basicStats’
from the package ‘fBasics’, [9] and note the sign of skewness coefficient).
HDV-3) Make the sixth element of x as ‘missing’ replacing the current value
by the notation ‘NA’. How would you use the mean and median of R command to automatically ignore the missing data? (Hint: use na.rm=T)
HDV-4) Use the scale function in R to convert Fahrenheit to Celsius. (Hint:
scale(212, center=32, scale=9/5) should give 100 degrees Celsius)
0.2
Exercise (read.table in R)
HDV-5) Create a directory called data in the C drive of your computer.
Place the following data in the data directory (”c:/data/xyz.txt”. The
data is short. It should be saved as simple text file using the ”save as ..”
option.
xyz
123
457
8 9 10
7 11 11
6 5 10
Use a suitable R command to read and analyze the data. (Hint use
‘read.table’)
0.3
Exercise (Subsetting, logical & or | in R)
HDV-15) Explain the distinction between parentheses and brackets in R
showing how brackets are used for subsetting. Set seed of 25 use a ransom sample of uniform numbers between 1 to 2000. Indicate which (if any)
locations have numbers exceeding 178 and at the same time less than 181.
How many numbers will be included if want numbers exceeding 178 or
less than 181.
set.seed(25); x=sample(1:2000); which(x>178&x<181)
n1=which(x>178&x<181)#define n1 vector of locations
x[n1]#values of x at those locations
2
x[x>178 & x<181]#using brackets for subsetting
#note that x is repeated inside brackets, logical & is used.
n2=which(x>178|x<181) #logical and replaced by logical or |
length(n2)
R produces the following output:
> set.seed(25); x=sample(1:2000); which(x>178&x<181)
[1] 744 1923
> n1=which(x>178&x<181)#define n1 vector of locations
> x[n1]#values of x at those locations
[1] 180 179
> x[x>178 & x<181]#using brackets for subsetting
[1] 180 179
> n2=which(x>178|x<181) #logical and replaced by logical or |
> length(n2)
[1] 2000
The output shows that only two numbers are inside the open interval (178,
181). These numbers have locations 744 and 1923, respectively, in the 2000×1
vector array called ‘x.’ Note that if we use greater than 178 or less than 181,
then all 2000 numbers satisfy this condition.
0.4
Exercise (Apple-Pie Sales regression example)
HDV-6) Download the following file from my website:
http://www.fordham.edu/economics/vinod/R-piesales.txt
into your own data directory and copy and paste various commands. Learn
what these commands do for a fully worked out regression example involving
a regression of Apple pie sales.
0.5
Exercise (R data manipulations)
HDV-7) Load the dataset called ‘Angell’ from the package ‘car’ [3] and summarize the data. and regress moral (Moral Integration: Composite of crime
rate and welfare expenditures) on hetero (Ethnic Heterogeneity: From percentages of nonwhite and foreign-born white residents) and mobility (Geographic Mobility: From percentages of residents moving into and out of the
city) and region in the US.
3
HDV-8) Use the ‘attributes’ function in R to determine which variable is
categorical or factor-type.
HDV-9) Use multivariate analysis of variance to study how hetero and mobility are affected by moral and region.
HDV-10) Use the aggregate function of R to find categorical sums for regions
Comment on low score for the mean on ‘moral’ variable in the South and low
variance for ‘mobility’ in the Eastern region.
#Answer Hints:
library(car);data(Angell);attach(Angell);summary(Angell)
#note that all variables are ratio-type except region
attributes(region)
reg1=lm(moral~hetero+mobility+region, data=Angell);
summary(reg1)
manova(cbind(hetero,mobility)~moral+region)
aggregate(cbind(hetero,mobility,moral), by=list(region), mean)
0.6
Exercise (Regression, DW, qq.plot)
HDV-11) Set the seed at 34 and create a sample from the set of integers
from 2 to 46 and place it in a 15 by 3 matrix called yxz. Make y, x and z as
names of first three columns. What is the p-value for the coefficient of z in
a regression of y on x and z? What does it suggest? What is the p value of
a fourth order Durbin Watson serial correlation test? What does it suggest?
Use ‘qq.plot’ command to decide whether the regression errors are close to
Student’s t.
#Answer Code
library(car)
set.seed(34);yxz=matrix(sample(2:46), 15,3);
y=yxz[,1]; x=yxz[,2];z=yxz[,3]
reg1=lm(y~x+z); su1=summary(reg1);su1$coeff
p4z=su1$coef[3,4];p4z
#Since p4z >0.05 z is insignificant regressor
durbin.watson(reg1, max.lag=4)
#large p value means accept the null hypothesis
4
20
●
●
●
●
●
●
0
●
●
●
−20
●
●
−10
resid(reg1)
10
●
●
●
●
−1
0
1
norm quantiles
Figure 1: Since all observed points (circles) lie within the confidence band
(dashed lines), do not reject normality of residuals.
5
#(H0=no serial correlation of order 4)
qq.plot(resid(reg1))
#residuals are inside the confidence bounds, hence good
For more information about the use of quantile-quantile plots to assess normality, see
http://en.wikipedia.org/wiki/Q-Q_plot
0.7
Exercise (histogram)
HDV-12) Show the use of hist command by using the CO2 data.
hist(co2)
Exercises Created by Clifford D. Goss (CDG)
0.8
Exercise (Simple Regression)
CDG-1) It is important in Econometrics to understand the basic long-hand
formulas of regression analysis. What is the equation of a regression line?
What is the purpose of a regression line? What is the difference between
simple regression and multiple regression? When we write the regression
equation y = b0 + b1 x + , the purpose is to summarize a relationship between
two variables x and y.
ANSWER: Simple regression only has one independent variable and multiple regression as more than one regressor (e.g. x1 and x2).
0.9
Exercise (Regression inference basics)
CDG-2) What are two inferential techniques to determine how accurate the
sample estimates b0 and b1 will be? How are they related?
ANSWER: Two inferential techniques are Confidence Interval Estimation
and Hypothesis Testing. See
http://en.wikipedia.org/wiki/Confidence_interval
6
0.10
Exercise (Error term)
CDG-3) A simple linear model will include an Error term. Why is it included
and what assumptions are made regarding its nature? What is a short-hand
notation in writing these assumptions about the error term?
ANSWER: The Error term represents the variation in y, not accounted
for by the linear regression model. It is incorporated in the model because no
real set of data is exactly linear and no model is error free. Assumptions made
for the error term include the assumption that it is normally distributed, has
a mean of zero, a variance of each t is σ 2 . The variance covariance matrix is
proportional to the identity matrix. The short-hand notation is t ∼ N (0, σ 2 )
0.11
Exercise (Intercept and slope)
CDG-4) What does b1 , in the simple linear regression equation represent?
How is it calculated? What does b0 represent? How is it calculated?
ANSWER: Let cov(x, y) = E(x−x̄)(y−ȳ) and var(x) = E(x−x̄)2 . b1 represents the slope of the regression line. It is calculated as the covariance(y,x)
divided by the variance (x) Its formula then is b1 = cov(x, y)/var(x). b0 is
called the intercept. It is calculated as ȳ − b1 x̄. According to these formulas,
the intercept cannot be calculated without first calculating the slope.
0.12
Exercise (Sampling distribution)
CDG-5) What does the term “Sampling Distributions of a Statistic” refer to?
ANSWER: Sampling distribution of a statistic refers to the probability distribution of the statistic in the sample space defined by all possible
samples. For instance, sampling distribution of the mean is the probability
distribution of all possible means computed from all possible samples taken
from the given population parent.
0.13
Exercise (Correlation coefficient)
CDG-6) What is the sample correlation coefficient between x and y? Derive
its formula and indicate what its range is.
ANSWER: Recall the covariance defined above. Correlation coefficient is
cov(x, y)
.
r=√
[var(x)var(y)]
7
(1)
Its range is: −1 ≤ r ≤ +1
0.14
Exercise (t test)
CDG-7) What is the purpose and formula of the t-Test? What do “degrees
of freedom” represent?
ANSWER: The t-test helps determine acceptance/rejection of the null
hypothesis that the true unknown regression coefficient β1 is zero. Roughly
speaking t-statistic close to 2 means the regression coefficient is statistically
significantly different from zero. In R Student’s t distribution is available
without having to look up t tables. Use the commands help(qt) to get the
details. See the following URL for details.
http://en.wikipedia.org/wiki/Student’s_t-test
0.15
Exercise (ANOVA)
CDG-8) What is total sum of squares. Describe its role in analysis of variance
ANSWER: See http://en.wikipedia.org/wiki/Total_sum_of_squares
http://en.wikipedia.org/wiki/Analysis_of_variance
0.16
Exercise (p-value)
CDG-9) What does the p-value represent? What is its range? How do we
know if we should accept or reject our null hypothesis by using the P-value?
ANSWER: Roughly speaking, small p-value (e.g. p-val< 0.05) means
reject the null. See
http://en.wikipedia.org/wiki/P-value
0.17
Exercise (DW)
CDG-10) What is the role of the Durbin-Watson (DW) statistic in regression?
Which package allows one to compute DW in R?
ANSWER: Roughly speaking, DW close to 2 means no problem of autocorrelated errors in a regression. Use the package ‘car’ to compute the DW
statistic. For theory, see
http://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic
8
0.18
Exercise (GMT)
CDG-11) What is the Gauss-Markov Theorem?
ANSWER: Roughly speaking, the GM theorem means least squares estimator is best linear unbiased (BLUE) without assuming normality of errors.
It is enough to assume that the covariance matrix of regression errors is
proportional to the identity matrix. For details see:
http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem
0.19
Exercise (skewness and kurtosis)
CDG-12) Discuss skewness and kurtosis as concepts
ANSWER: See http://en.wikipedia.org/wiki/Kurtosis
http://en.wikipedia.org/wiki/Skewness
0.20
Exercise (power of a test)
CDG-13)
What is a statistical power function? Write the R command for a onesided and two-sided t-test using the Type I error α = 0.05, 2 degrees of
freedom, and various non-centrality values. Plot the power function.
Answer: See for the theory
http://en.wikipedia.org/wiki/Statistical_power
Power is probability of rejecting a hypothesis when it is false and should
be rejected. When the true value equals the hypothesized value of zero,
the probability of rejecting the null is set at the alpha level of Type I error
(=0.05). When we know the probability distribution of a test statistic under the alternative hypothesis, then we can write the power function as a
function of the alternative parameter value. If we are considering the t test,
the probability distribution under the alternative is noncentral t. In R the
function power.t.test computes these probabilities. However when the true
value is zero, R computes the power to be 0.025 instead of 0.05, and seems to
be incorrect. Two sided tests do have a lower power curve than corresponding one-sided tests. Thus, if the direction of alternative hypothesis (side) is
known, it is advisable to use a one-sided test.
x=seq(-.2,.2,by=.01)
pw=power.t.test(delta = x, n=1000, typ="one.sample",
alternative = "one.sided")
9
0.6
0.4
0.0
0.2
Prob of Type II error
0.8
1.0
Power function for two−sided t test
−0.2
−0.1
0.0
0.1
0.2
Parameter value alternative hypothesis
Figure 2: As non-centrality (true value of statistic) increases the t test for
the null of zero true value becomes more and more reliable (powerful).
10
plot(x,pw$p, typ="l", main="Power function for one-sided t test",
ylab=" prob of Type II error")
pw=power.t.test(delta = x, n=1000, typ="one.sample",
alternative = "two.sided")
plot(x,pw$p, typ="l", main="Power function for two-sided t test",
ylab="Prob of Type II error",
xlab="Parameter value alternative hypothesis")
#ALTERNATIVE formulation
#power of one-sided t test works
t.power=function(alph,df,noncen){crit=qt(1-alph,df)
power=1-pt(crit,df,ncp=noncen);return(power)} #function ends here
x=seq(-5,5,by=.2); df=10; alph=0.05; y=t.power(alph,df,x);
plot(x,y,main="One-sided t test Power Curve", typ="l")
t2.power=function(alph,df,noncen){
al2=alph/2
crit=qt(1-al2,df)
power=1-pt(crit,df,ncp=noncen);return(power)} #function ends here
x=seq(-5,5,by=.2); df=10; alph=0.05; y=t2.power(alph,df,abs(x));
plot(x,y,main="Two-sided t test Power Curve", typ="l",
ylab="Prob of Type II error",
xlab="Parameter value alternative hypothesis")
0.21
Exercise (Production function)
CDG-14) What is a production function? What is the Cobb-Douglas production function? Derive the slope of the Isoquant assuming two standard
inputs, Labor and Capital. What is Elasticity of Substitution (EOS)? What
is Output Elasticity? For a Cobb-Douglas production function, what is the
EOS and Output Elasticity? What measures economies of scale?
ANSWER: Check the index in Vinod’s text. The answers in Chapter 1.
HDV-16) Estimate the Nerlove-Ringstatd with Cobb-Douglas production
function for Metals data in your textbook available in ‘Ecdat’ package.
[Hint use Zellner-Ryu (1998) Journal of Applied Econometrics, vol 13,
11
pages 101-127. It is instructive to use many more functional forms discussed
there.]
rm(list=ls()) #clean up R memory
library(Ecdat); #into current memory of R
data(Metal)#pulls the Metal data into memory of R
names(Metal)
summary(Metal)
met=as.matrix(Metal)
Ly=log(met[,1])#pull first column of met, take log, define Ly
LL=log(met[,2])#pull second col. of met, take log, define LL
LK=log(met[,3])#pull third col. of met, take log, define LK
reg=lm(Ly~ I(Ly^2)+LK+LL); summary(reg)
0.22
Exercise (Heteroscedasticity)
CDG-15) What is Heteroscedasticity? What type of problem does Heteroscedasticity cause? What is another way to refer to Heteroscedasticity and
what command function in R- Project is available to test for Heteroscedasticity?
Answer: See http://en.wikipedia.org/wiki/Heteroskedasticity for
theory. The package ‘car’ has tests and function ‘hccm’ does heteroscedasticity consistent covariance matrix. Use sqrt to get the corrected standard
errors.
The package called ‘lmtest’ [11] has a function ‘bptest’ for Breusch-Pagan
test and ‘gqtest’ for Goldfeld-Quandt test ‘vcovHC’ gives variance covariance
matrix adjusted for heteroscedasticity.
1
1.1
Chaper 1 and Using R for a Hands-On
Study of Matrix Algebra
Exercises created by Steven Carlsen
The exercises below are intended to demonstrate some of the basic, but important, properties of matrix algebra. Note that they are examples of these
12
properties, not proofs. Please consult [7], Vinod’s companion book for a detailed study of matrix algebra using R. http://www.worldscibooks.com/
mathematics/7814.html
SC) The exercises in this set generally use a set of three matrices denoted
A, B and C throughout. Because of the computing power and accuracy of R,
the matrices will all have dimensions of five rows by five columns (larger than
what we could normally work easily by hand, but not so large as to make
comparisons difficult). Note that the output from R is often not included for
brevity.
In many of these exercises, we could demonstrate equality (or inequality)
by simply comparing the results on either side of an equation. An equivalent
way of showing that two sides of an equation are equal (or not equal) is to
show that one side MINUS the other side is equal to zero (or not equal to
zero). We will generally do both below.
For notation purposes, let’s use XpY as the name of the matrix formed
by adding X and Y (p will stand for ”plus”), let’s use XmY as the name of
the matrix formed by subtracting Y from X (m will stand for “minus”), let’s
use XY as the name of the matrix formed by post-multiplying X by Y and
let’s use Xt as the name of the transpose of matrix X.
Start by creating our three matrices, using random numbers, rounded
off to whole integers. Note that by using different seed numbers, we create
different matrices even though we selected random numbers from within the
same range (25, 50) of numbers.
set.seed(1)
A=round(matrix(runif(25,1,50),5,5),0)
set.seed(2)
B=round(matrix(runif(25,1,50),5,5),0)
set.seed(3)
C=round(matrix(runif(25,1,50),5,5),0
A;B;C #outputs omitted for brevity
ApB = A+B; ApB
BpA = B+A; BpA
ApBmBpA = ApB - BpA; ApBmBpA
#outputs omitted for brevity
Note that ApB is identical to BpA and that ApBmBpA (which is equal to
(A+B)-(B+A)) confirms the equality by having all zero elements.
13
Verify that AB is not identical to BA and that subtracting BA from AB
produces a non-zero matrix.
AB=A %*%B; BA=B%*%A; AB-BA
Verify that (A’)’ = A Verify that (AB)’ = B’A’
At = t(A); At;
Att = t(At); Att; A-Att;
Verify that determinant of A+B is not the sum of two determinants.
Verify following: If matrix D is obtained by interchanging a pair of rows
(or columns) from matrix A, then det(D) = -det(A)
A1 = A[1,]
A2 = A[2,]
A3 = A[3,]
A4 = A[4,]
A5 = A[5,]
D = rbind(A2,A1,A3,A4,A5)
det(D)
#[1] 18606520
det(A)
#[1] -18606520
#Note that det(D) = -det(A)
Verify following: If matrix E is obtained by multiplying a row (or column)
of matrix A by a constant k, then det(E) = kdet(A)
A1 = A[1,]
A2 = A[2,]
A3 = A[3,]
A4 = A[4,]
A5 = A[5,]
A1k = 2*A1
E = rbind(A1k,A2,A3,A4,A5)
det(E)
#[1] -37213040
2*det(A)
#[1] -37213040
#Note that det(E) = 2*det(A)
14
Verify the following: The determinant of a diagonal matrix is equal to the
product of the diagonal elements. Also that the determinant of a matrix is
equal to the product of its eigenvalues.
set.seed(40);d=sample(4:40)[1:5];d
D=diag(d) #this is how you create a diagonal matrix from vector in R
det(D)
cumprod(d) #gives sequential multiplication of elements of d
ei=eigen(D)$val; ei
cumprod(ei)
1.2
Exercise (Create collinear data and study it)
HDV-13) Set the seed at 345 and create a sample from the set of integers from
2 to 17 and place it in a 4 by 4 matrix called A. Note that the determinant
is negative even if all numbers are positive. If B is obtained from A by
multiplying the elements the second row by constant k=5, then det(B)= k
det(A). Use the second row and third column for verifying the above.
How is this property of determinants different from matrices? The matrix
multiplication kA means multiply each element of A by k. Use k=5 to check
this.
Let n denotes the number of rows. Show by using the above example that
det(kA)= k det(A), whence det(-A) = (−1)n det(A), by choosing k= -1.
set.seed(345);s1=sample(2:17);A=matrix(s1,4,4);A
det(A)
B=A #initialize multiply second row by 5
B[2,]=5*A[2,]; det(B); det(B)/5
(-1)^4; det(-A)
1.3
Exercise (Quadratic form)
HDV-14) What is a quadratic form. How can matrix algebra be used to
represent a quadratic form?
ANSWER: See an example of a matrix algebra quadratic form
http://en.wikipedia.org/wiki/Quadratic_form
15
where the quadratic form is ax2 + bxy + cy 2 = v 0 M v, where v is a 2 × 1 vector
with elements (x, y) and where M is a 2 × 2 symmetric matrix with a and c
along the diagonal and b/2 in the off diagonal.
Expressions for expectations and variances and covariances involving quadratic
forms are available at
http://en.wikipedia.org/wiki/Quadratic_form_(statistics)
where it is shown how residual sum of squares can be written as a quadratic
form RSS = y 0 (I −H)0 (I −H)y (identity minus hat matrix). Under normality
of regression errors, RSS is distributed as a Chi-square random variable with
degrees of freedom k = T race[(I − H)0 (I − H)]. If EHy 6= µ due to bias,
the Chi-square variable is noncentral with a nonzero noncentrality parameter
λ = 0.5µ0 (I − H)0 (I − H)µ.
Following Exercises are contributed by Chris Finley (CJF)
1.4
Exercise (Eigenvalues-eigenvectors)
CJF-1) Using a seed of 30 and numbers from 13 to 16 create a square matrix
X. What are the eigenvalue and eigenvectors?
set.seed(30); X=matrix(sample(13:16), 2); eigen(X)
R produces the following output:
$values
[1] 29 -1
$vectors
[,1]
[,2]
[1,] -0.7071068 -0.7525767
[2,] -0.7071068 0.6585046
Using a seed of 30 and numbers from 1 to 25 create a square matrix. What
is the 3rd diagonal number?
set.seed(30)
X=matrix(sample(1:25),5)
X[3,3] # 8
16
1.5
Exercise (Pie-chart, histogram)
CJF-2) Using the house prices data in the AER library, determine the number
of properties that have a driveway. Using R, determine the percentage of
properties that do not have a driveway. Show using a pie chart. Create a
histogram of the number of stories.
library(AER)
data(HousePrices)
attach(HousePrices)
summary(driveway)
R produces the following output:
no yes
77 469
Now some R code to tabulate the data.
tab=table(driveway)
prop.table(tab)
R produces the following output:
driveway
no
yes
0.1410256 0.8589744
> prop.table(tab)
driveway
no
yes
0.1410256 0.8589744
Now the code for plotting.
par(mfrow=c(2,1)) #two graphs in one
pie(tab, main="Houses with Driveways")
hist(stories)
par(mfrow=c(1,1)) #reset
Using the house prices data in the AER library, find the correlation coefficient
and Spearman’s ρ. Comment on the result.
17
Houses with Driveways
no
yes
100
0
Frequency
Histogram of stories
1.0
1.5
2.0
2.5
stories
18
3.0
3.5
4.0
library(AER); data(HousePrices); attach(HousePrices)
cor(bedrooms, bathrooms)
R produces the following output:
[1] 0.3737688
cor(bedrooms, bathrooms, method = "spearman")
[1] 0.3769018
Both measures are very close and indicate very little correlation. For further
comments on how rank correlation is non-parametric measure see:
http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
Using the house prices data in the AER library, regress bathrooms on price.
What are the 95% and 99% confidence intervals?
reg1=lm(price~bathrooms, data=HousePrices)
confint(reg1)
R produces the following output:
2.5 %
97.5 %
(Intercept) 27502.25 38085.84
bathrooms
23642.71 31311.26
confint(reg1, level=0.99)
0.5 %
99.5 %
(Intercept) 25830.49 39757.60
bathrooms
22431.41 32522.56
Using the house prices data in the AER library, separately regress lotsize on
price and bedrooms on price. Find the R2 of each and comment.
rm(list=ls())
library(AER)
data(HousePrices)
reg1=lm(price~lotsize, data=HousePrices)
summary(reg1)$adj.r.squared # 0.287077
reg2=lm(price~bedrooms, data=HousePrices)
summary(reg2)$r.squared # 0.1342837
19
28% of the variance in price is explained by lotsize while only 13% of the
variance is explained by bedrooms.
Using the HousePrice data, regress lotsize on price, but include an additional
square term lotsize2 and plot price against lotsize.
plot(price~lotsize, data=HousePrices, xlab="Lot Size", pch=16)
reg3=lm(price~lotsize, data=HousePrices)
summary(reg2)$adj.r.squared # 0.1326923
abline(reg3)
reg3b=lm(price~lotsize+I(lotsize^2), data=HousePrices)
summary(reg3b)$adj.r.squared # 0.3205318 is higher
xval=pretty(HousePrices$lotsize, 50)
hat2=predict(reg3b, newdata=list(lotsize=xval))
lines(xval, hat2, col="red", lty=2, lwd=2)
The model having the quadratic term obviously fits the data better than
the model without it. However, this is always the case in terms sum of
squared residuals. Hence, the correct comparison is between adjusted R2
values. In this example, adding the quadratic term increases the adjusted R2
from 0.13 to 0.32.
1.6
Exercise (Tabulation, Box-plot)
CJF-3) Using the Parade2005 data from the AER package, [4] find the mean
wages conditional on gender and create a box plot.
library(AER)
data(Parade2005)
attach(Parade2005)
tapply(log(earnings), gender, mean)
# female
male
#11.11900 11.30194
plot(log(earnings)~gender)
#simple plot gives a box plot here since gender is
#a categorical (factor) variable
box.plot(log(earnings), gender)
20
●
16
●
●
●
●
●
●
●
14
●
●
●
12
●
10
log(earnings)
●
●
female
male
gender
Figure 3: Box Plot: Earnings and Gender.
21
1.7
Exercise (Basic stats)
CJF-4) Summarize the BondYield data in the AER library. What is the
kurtosis? What is the skewness? Does the skewness value make sense given
the values of mean and the median?
library(AER)
data(BondYields)
library(fBasics)
basicStats(BondYield)
R produces the following output:
BondYield
nobs
60.000000
NAs
0.000000
Minimum
6.660000
Maximum
9.720000
1. Quartile
7.902500
3. Quartile
8.945000
Mean
8.290833
Median
8.300000
Sum
497.450000
SE Mean
0.104351
LCL Mean
8.082027
UCL Mean
8.499640
Variance
0.653354
Stdev
0.808303
Skewness
-0.234511
Kurtosis
-0.823370
Since the mean is very close to the median, it may seem that skewness
should be very close to 0. However, the official measure is not that close
to zero. The point is that the simple comparison of mean and median to
indicate skewness is merely a crude approximation.
1.8
Exercise (Model comparison)
CJF-5) Using the Guns data in the AER library, regress the following 2
models:
22
Violent = b0 + b1(prisoners) + b2(income) + b3(density) + Violent = b0 + b1(prisoners) + b2(income) + b3(density) + b4(law) + Compare the 2 models. Which model is better? How can you tell?
rm(list=ls())
library(AER)
data(Guns)
reg1=lm(violent~prisoners+income+density,data=Guns)
reg2=lm(violent~prisoners+income+density+law,data=Guns)
summary(reg1);summary(reg2)
R produces the following output:
Call:
lm(formula = violent ~ prisoners + income + density, data = Guns)
Residuals:
Min
1Q
-1173.95 -145.04
Median
-35.04
3Q
119.28
Coefficients:
Estimate Std. Error
(Intercept) 1.605e+02 3.539e+01
prisoners
8.525e-01 4.420e-02
income
8.435e-03 2.732e-03
density
9.554e+01 5.510e+00
--Signif. codes: 0 '***' 0.001 '**'
Max
671.42
t value Pr(>|t|)
4.536 6.33e-06 ***
19.288 < 2e-16 ***
3.088 0.00206 **
17.341 < 2e-16 ***
0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 210.5 on 1169 degrees of freedom
Multiple R-squared: 0.6044,
Adjusted R-squared: 0.6034
F-statistic: 595.3 on 3 and 1169 DF, p-value: < 2.2e-16
>summary(reg2)
OUTPUT by R
Call:
lm(formula = violent ~ prisoners + income + density + law, data = Guns)
23
Residuals:
Min
1Q
-1225.51 -137.98
Median
-39.65
3Q
118.48
Max
777.04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.882e+02 3.393e+01
5.546 3.61e-08 ***
prisoners
9.069e-01 4.257e-02 21.303 < 2e-16 ***
income
8.350e-03 2.612e-03
3.197 0.00142 **
density
8.637e+01 5.339e+00 16.176 < 2e-16 ***
lawyes
-1.465e+02 1.390e+01 -10.535 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 201.3 on 1168 degrees of freedom
Multiple R-squared: 0.6387,
Adjusted R-squared: 0.6375
F-statistic: 516.3 on 4 and 1168 DF, p-value: < 2.2e-16
anova(reg1, reg2)
Analysis of Variance Table
Model 1: violent ~ prisoners + income + density
Model 2: violent ~ prisoners + income + density + law
Res.Df
RSS
Df Sum of Sq
F
Pr(>F)
1
1169 51807994
2
1168 47312125
1
4495869 110.99 < 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis of variance shows that the second model is better. It includes
the factor of whether the state has ‘law’ in effect for each year. The anova
summary indicates that law is significant at any reasonable level.
Using the Guns data, run a regression of income on murder and plot the
results and include a regression line in the plot.
plot(murder~income, data=Guns)
reg1=lm(murder~income, data=Guns)
abline(reg1)
24
Use the regression above, but this time include a quadratic term. Does the
quadratic term help the explanation? Which model is better?
reg2=lm(murder~income+I(income^2), data=Guns)
anova(reg1,reg2)
R produces the following output:
Analysis of Variance Table
Model 1: murder ~ income
Model 2: murder ~ income + I(income^2)
Res.Df
RSS
Df Sum of Sq
F
Pr(>F)
1
1171 63099
2
1170 54868
1
8231 175.52 < 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Including the square term gives a better model as indicated by the statistical significance of the F-test.
1.9
Exercise (Regression data plots)
CJF-6) using the data from USMacroB from the AER package, plot gnp and
mbase (include a legend) and regress mbase and lagged gnp on gnp. What
is the residual sum of squares (RSS)? Run a Durbin-Watson test to test for
autocorrelation. Comment on the results
rm(list=ls())
library(AER)
data(USMacroB) #cannot access by name
#attach(data.frame(USMacroB))
plot(USMacroB[,c("gnp", "mbase")], lty=c(3,1),
plot.type="single", ylab="", lwd=1.5)
legend("topleft", legend = c("gnp", "money base"),
lty = c(3,1), bty="n")
library(dynlm)
reg1=dynlm(gnp~mbase + L(gnp), data=USMacroB)
summary(reg1)
deviance(reg1)
dwtest(reg1)
25
0
1000
2000
3000
4000
5000
gnp
money base
1960
1970
1980
1990
Time
Figure 4: Data Time Series Plot with legend.
26
R produces the following output:
Time series regression with "ts" data:
Start = 1959(2), End = 1995(2)
Call:
dynlm(formula = gnp ~ mbase + L(gnp), data = USMacroB)
Residuals:
Min
1Q
-120.2126 -15.9883
Median
0.7727
3Q
16.8630
Max
90.0175
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.951508 17.394776
2.527
0.0126 *
mbase
0.051049
0.025615
1.993
0.0482 *
L(gnp)
0.988237
0.007722 127.984
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 30.28 on 142 degrees of freedom
Multiple R-squared: 0.9991,
Adjusted R-squared: 0.9991
F-statistic: 7.894e+04 on 2 and 142 DF, p-value: < 2.2e-16
deviance(reg1)
[1] 130195.2
>dwtest(reg1)
Durbin-Watson test
data: reg1
DW = 1.3521, p-value = 2.067e-05
alternative hypothesis: true autocorrelation is greater than 0
There is highly significant positive autocorrelation among regression errors.
Exercises suggested by Adam R. Bragar (ARB)
1.10
Exercise (Basic stats)
ARB-1) .
27
Load the CigarettesB dataset from the AER package. Find basic stats of
all variables. Are any variables negatively skewed? If so, which variable(s)?
What does a negatively skewed variable suggest?
library(AER)
data(CigarettesB); attach(CigarettesB) #both needed
library(fBasics)
basicStats(CigarettesB)
R produces the following output:
packs
nobs
46.000000
NAs
0.000000
Minimum
4.408590
Maximum
5.379060
1. Quartile
4.711547
3. Quartile
4.984450
Mean
4.847844
Median
4.814950
Sum
223.000810
SE Mean
0.028229
LCL Mean
4.790988
UCL Mean
4.904700
Variance
0.036656
Stdev
0.191458
Skewness
0.185864
Kurtosis
-0.124126
price
income
46.000000 46.000000
0.000000
0.000000
-0.032600
4.529380
0.363990
5.102680
0.140543
4.679075
0.273485
4.852850
0.205509
4.775455
0.200205
4.758505
9.453400 219.670930
0.012714
0.020975
0.179901
4.733209
0.231116
4.817701
0.007436
0.020238
0.086230
0.142261
-0.126862
0.474103
-0.301650 -0.378464
The price variable has negative skewness. This suggests that there are few
relatively small values of price in the left tail of its probability distribution.
1.11
Exercise (Basic regression)
ARB-2)
Using the CigaretteB dataset from the AER package, regress packs on
price and income. Interpret the coefficients. Are any of the variables statistically significant? How can you tell? What was the null hypothesis of the
test you used for evaluation?
28
library(AER)
data(CigarettesB); attach(CigarettesB)
reg1= lm(packs~price+income); summary(reg1)
R produces the following output:
Call:
lm(formula = packs ~ price + income)
Residuals:
Min
1Q
-0.418675 -0.106828
Median
0.007568
3Q
0.117384
Max
0.328677
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
4.2997
0.9089
4.730 2.43e-05 ***
price
-1.3383
0.3246 -4.123 0.000168 ***
income
0.1724
0.1968
0.876 0.385818
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1634 on 43 degrees of freedom
Multiple R-squared: 0.3037,
Adjusted R-squared: 0.2713
F-statistic: 9.378 on 2 and 43 DF, p-value: 0.0004168
The price coefficient suggests that a 1 unit increase in price will cause a
1.3383 unit decrease in the demand for packs. A 1 unit increase in income
causes a .1724 increase in the demand for cigarette packs.
The price variable is statistically significant. It’s t-value of -4.123 allows
us to reject the null hypothesis, as evidenced by a p-value of 0.0000168. H0:
There is no linear relationship between price and packs.
1.12
Exercise (Returns from asset prices)
ARB-3)
Using the BondYield dataset found in the AER package, create a variable
r that denotes returns. Provide a summary of r and the standard deviation
of r. What does the standard deviation tell us about the volatility of the
bond yield?
29
library(AER)
data(BondYield); attach(BondYield)
r=diff(log(BondYield))
r; summary(r)
sd(r)
Now the output is:
Jan
1990
1991
1992
1993
1994
1990
1991
1992
1993
1994
1990
1991
1992
1993
1994
-0.001105583
-0.013325455
-0.008810630
-0.001444044
Jun
-0.022424859
0.016788307
-0.007272759
-0.013550343
-0.002506267
Nov
-0.024430318
-0.008220833
0.013673302
0.038239953
0.012753796
Feb
0.078072770
-0.023504160
0.010915815
-0.025609594
0.022858138
Jul
-0.002162163
-0.001110494
-0.018416727
-0.022069861
0.017413375
Dec
-0.027249642
-0.020250841
-0.014925650
0.000000000
-0.025672355
Mar
-0.036672522
0.011261380
0.007211570
-0.017004988
0.054958884
Aug
0.018231068
-0.028170877
-0.014981554
-0.045657002
-0.004944386
Min.
1st Qu.
Median
Mean
-0.045660 -0.016040 -0.003143 -0.001030
sd(r)
[1] 0.02265122
Apr
0.009559287
-0.007869630
-0.002398083
-0.015957785
0.052095112
Sep
0.015814773
-0.016129382
-0.003780723
-0.028129168
0.032909734
3rd Qu.
0.011090
May
0.001056524
0.000000000
-0.006020488
-0.004029555
0.013862856
Oct
-0.003143009
-0.006993035
0.008799554
0.001500375
0.027204516
Max.
0.078070
A small standard deviation tells us that most values are near the mean. The
mean is near 0. Therefore, a small standard deviation would tell us that
most returns are near 0. A larger standard deviation would suggest that the
values are widely dispersed, and while the mean return is 0, the returns are
volatile.
30
1.13
Exercise (Regression)
ARB-4)
Using the CigarettesB dataset, run the regression of packs on price and
income. Compute the heteroscedasticity consistent covariance matrix. Why
is the presence of heteroscedasticity dangerous to the results of a regression?
Compute Studentized Breusch-Pagan test Non-constant Variance Score test,
and Goldfeld Quant F test for heteroscedasticity and
library(AER)
data(CigarettesB); attach(CigarettesB)
reg1= lm(packs~price+income);su1=summary(reg1)
su1$coef[,2] #Usual OLS standard errors from second column after $coef
vcv=hccm(reg1);vcv #gives the variance covariance matrix
se2=sqrt(diag(vcv)); se2 #new standard errors
R produces the following output:
Check the reg1 output above.
usual OLS standard errors
(Intercept)
price
income
0.9089257
0.3246015
0.1967544
(Intercept)
price
income
(Intercept)
1.3486844 0.26842214 -0.29138366
price
0.2684221 0.14398409 -0.06218249
income
-0.2913837 -0.06218249 0.06316924
#hetero consistent standard errors are larger
(Intercept)
price
income
1.1613287
0.3794524
0.2513349
se2/su1$coef[,2] #ratio of standard errors $
(Intercept)
price
income
1.277694
1.168979
1.277404
When heteroscedasticity is present, the OLS estimators remain unbiased,
but are inefficient. The estimates of variance of the beta coefficients will not
be correct. It appears that in this example the standard errors are underestimated. Heteroscedasticity consistent standard errors are all bit larger by
28%, 17% and 28% respectively for the three coefficients.
31
1.14
Exercise (Regression for counts data)
ARB-5)
Load the CreditCard data set from the AER package. Do a Poisson Regression of age, income and expenditures on reports. Which of the variables
are significant?
data(CreditCard)
attach(CreditCard)
cc_pois= glm(reports ~ age + income + expenditure, family = poisson)
summary(cc_pois)
R produces the following output:
Call:
glm(formula = reports ~ age + income + expenditure, family = poisson)
Deviance Residuals:
Min
1Q
Median
-1.7427 -1.0689 -0.8390
3Q
-0.3897
Coefficients:
Estimate Std. Error
(Intercept) -0.819682
0.145272
age
0.007181
0.003978
income
0.077898
0.023940
expenditure -0.004102
0.000374
--Signif. codes: 0 '***' 0.001 '**'
Max
7.4991
z value Pr(>|z|)
-5.642 1.68e-08 ***
1.805 0.07105 .
3.254 0.00114 **
-10.968 < 2e-16 ***
0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for the Poisson family is taken to be 1)
Income and expenditure are significant.
1.15
Exercise (Regression 2)
ARB-6)
Using the CreditCard data from AER package, regress active, age, card
and owner. Print a summary. Which variables are significant at the 5 percent level? Are they all significant at the 1 percent level? Find 95 percent
confidence intervals for each coefficient.
32
reg1=lm(reports~active+age+card+owner);
summary(reg1)
R produces the following output:
Call:
lm(formula = reports ~ active + age + card + owner)
Residuals:
Min
1Q
Median
-3.10658 -0.41350 -0.03806
3Q
Max
0.21015 11.99671
Coefficients:
Estimate Std. Error
(Intercept) 1.213853
0.126452
active
0.055222
0.005264
age
0.002778
0.003386
cardyes
-1.495554
0.077137
owneryes
-0.172956
0.071351
--Signif. codes: 0 '***' 0.001 '**'
t value Pr(>|t|)
9.599
<2e-16 ***
10.491
<2e-16 ***
0.821
0.4121
-19.388
<2e-16 ***
-2.424
0.0155 *
0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.153 on 1314 degrees of freedom
Multiple R-squared: 0.2681,
Adjusted R-squared: 0.2659
F-statistic: 120.3 on 4 and 1314 DF, p-value: < 2.2e-16
Active, card and owner are significant at the 5 percent level. Owner is
not significant a t the 1 percent level as exemplified by its 0.0155 p-value.
confint(reg1)
In response to above one-line code, R produces the following output:
2.5 %
97.5 %
(Intercept) 0.965783363 1.461921923
active
0.044896152 0.065548539
age
-0.003864662 0.009421637
cardyes
-1.646878701 -1.344228814
owneryes
-0.312929541 -0.032982226
33
1.16
Exercise (Logit)
ARB-7) Using the CreditCard data from AER package, do a logit function
where we are regressing age, income and selfemp on card to determine these
factors influence on the probability a person has a credit card or not. Which
variables are significant?
glm(formula = card ~ age + income + selfemp,
family = binomial("logit"))
R produces the following output:
Deviance Residuals:
Min
1Q
Median
-2.3137
0.5100
0.6902
3Q
0.7419
Coefficients:
Estimate Std. Error
(Intercept) 0.923442
0.240541
age
-0.007033
0.006825
income
0.183021
0.048904
selfempyes -0.567567
0.242977
--Signif. codes: 0 '***' 0.001 '**'
Max
1.0553
z value
3.839
-1.031
3.742
-2.336
Pr(>|z|)
0.000124 ***
0.302764
0.000182 ***
0.019497 *
0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1404.6
Residual deviance: 1385.0
AIC: 1393.0
on 1318
on 1315
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
Income and selfemp are significant at the 5 percent level, selfemp is not
significant at the 1 percent level.
1.17
Exercise (Diagonal matrix)
ARB-8)
Use two methods to define an identity matrix of dimension 5.
34
ident5=diag(5) ; ident5
IM5= matrix(0, nr=5, nc= 5)
IM5[row(IM5) == col(IM5)] = 1 ; IM5
1.18
Exercise (‘sort.list’ or ‘which.max’ functions)
ARB-9)
Download the DM data set from the Ecdat package [2]. Create a variable
that is the absolute value of the forward premium, call it FP. Find the date
where FP is the greatest. In the following code the ‘sort list’ sorts the matrix
named ‘MtxFP’ with reference to its fifth column which contains FP data.
FP= abs(((f-s)/s)*100)
MtxFP=cbind(date,S,f,s30,FP)
MtxFP[sort.list(MtxFP[,5],decreasing=T)]
Instead of ‘sort.list’ one can also use the following code:
rm(list=ls()) #clean up memory of R
library(Ecdat); data(DM); attach(DM)
FP= abs(((f-s)/s)*100)#f=forward ask, s=spot rate
MtxFP=cbind(date*10000,s,f,s30,FP) #s30=bid price
which.max(FP)
n=which.max(FP) #location of the max of FP
round( MtxFP[(n-4):(n+4),],3) #nearby data
output is suppressed for brevity.
1.19
Exercise (Stock market beta)
ARB-10)
Write a function to estimate the beta of Nucor (NYSE: NUE) using weekly
returns and data loaded directly from Yahoo Finance. Note, you will need
the ‘tseries’ package [5] for this exercise.
ANSWER: we will first write a general function for getting historical stock
price data for any stock and for any stock index. We choose S&P 500 index
to define the overall market return. The so-called ‘beta’ compares the risk
of an individual stock with that of the market as a whole. We compute first
35
difference of log of the adjusted closing price of NUE to define the return
from it and similarly for S&P 500 index. Now the slope coefficient in this
regression may be called the ‘beta’ for NUE stock. It is estimated to be
1.423869 implying a higher than average risk associated with investing in
NUE. If we had chosen IBM as the stock, the beta becomes less than unity
at 0.9565073, suggesting less risk.
library(tseries)
stockquote= function(x) {
c(get.hist.quote(x, quote= "Adj",
start= "2003-01-01", compress= "w"))}
NUE=stockquote("NUE")
sp500= stockquote("^gspc")
dlnue=diff(log(NUE))
dlsp=diff(log(sp500))
reg1=lm(dlnue~dlsp)
summary(reg1)
R produces the following output:
Call:
lm(formula = NUE ~ ., data = r)
Residuals:
Min
1Q
-0.167618 -0.026432
Median
0.001302
3Q
0.026874
Max
0.171358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.004181
0.002628
1.591
0.113
sp500
1.366223
0.113219 12.067
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04582 on 302 degrees of freedom
(157 observations deleted due to missingness)
Multiple R-squared: 0.3253,
Adjusted R-squared: 0.3231
F-statistic: 145.6 on 1 and 302 DF, p-value: < 2.2e-16
36
−2 0
−5
Empirical fluctuation process
OLS−based CUSUM test
0.0
0.2
0.4
0.6
0.8
1.0
0.8
1.0
5 10
Recursive CUSUM test
0
Empirical fluctuation process
Time
0.0
0.2
0.4
0.6
Time
Figure 5: NUE stock price against S&P 500 stock index cusum test for
structural change (OLS and Recursive residuals).
1.20
Exercise (Structural change cusum test)
Use the data from the previous exercise and compute stuructural change
statistics in its return behavior by using ‘strucchange’ package [1] and the
function ‘efp’ for empirical fluctuation process. Plot OLS cumulative sum
(CUSUM) and recursive CUSUM tests. When the CUSUM curve goes outside the confidence band, we reject the null hypothesis of no structural change
at that point in time. In the attached figure, the recursive test bands are
shown to be expanding in width as the number of observations increases.
#previous exercise code must be in memory
library(strucchange)
par(mfrow=c(2,1))
37
4
2
−2
Empirical fluctuation process
OLS−based MOSUM test
0.2
0.4
0.6
0.8
Recursive MOSUM test
0 2 4 6
Empirical fluctuation process
Time
0.2
0.4
0.6
0.8
Time
Figure 6: NUE stock price against S&P 500 stock index moving sum test for
structural change (OLS and Recursive residuals).
rs=efp(NUE~sp500,
plot(rs)
rs=efp(NUE~sp500,
plot(rs)
rs=efp(NUE~sp500,
plot(rs)
rs=efp(NUE~sp500,
plot(rs)
type="OLS-CUSUM")
type="Rec-CUSUM")
type="OLS-MOSUM")
type="Rec-MOSUM")
Note that moving sum test behaves differently from the cumulative sum test
in the earlier figure.
38
1.21
Exercise (Time series data manipulation and plots)
ARB-11)
Suppose we have the a dataset of monthly returns from 1991 to 2005 that
is in a raw form as shown below:
rawdata= -0.21,-2.28,-2.71,2.26,-1.11,1.71,2.63,-0.45,-0.11,
4.79,5.07,-2.24,6.46, 3.82,4.29,-1.47,2.69,7.95,4.46,7.28,
3.43,-3.19,-3.14,-1.25,-0.50,2.25,2.77,6.72, 9.17,3.73,
6.72,6.04,10.62,9.89,8.23,5.37,-0.10,1.40,1.60,3.40,3.80,
3.60,4.90,9.60, 18.20,20.60,15.20,27.00,15.42,
13.31,11.22,12.77,12.43,15.83,11.44,12.32,12.10,
12.02,14.41,13.54,11.36,12.97,10.00,7.20,8.74,3.92,
8.73,2.19,3.85,1.48,2.28,2.98, 4.21,3.85,6.52,8.16,
5.36,8.58,7.00,10.57,7.12,7.95,7.05,3.84,4.93,4.30,5.44,
3.77, 4.71,3.18,0.00,5.25,4.27,5.14,3.53,
4.54,4.70,7.40,4.80,6.20,7.29,7.30,8.38,3.83,
8.07,4.88,8.17,8.25,6.46,5.96,5.88,5.03,4.99,5.87,
6.78,7.43,3.61,4.29,2.97,2.35,
2.49,1.56,2.65,2.49,2.85,1.89,3.05,2.27,2.91,3.94,
2.34,3.14,4.11,4.12,4.53,7.11, 6.17,6.25,7.03,4.13,
6.15,6.73,6.99,5.86,4.19,6.38,6.68,6.58,5.75,7.51,
6.22,8.22, 7.45,8.00,8.29,8.05,8.91,
6.83,7.33,8.52,8.62,9.80,10.63,7.70,8.91,7.50,5.88,9.82,
8.44,10.92,11.67
Convert this data into a time series for returns called r. After analyzing
this data, suppose we wanted to only analyze the data from Nov. 1994
onward. Truncate r so that it fits this specified time range. Do an ARMA
estimation of r. Use the tsdiag function to print diagnostic pictures of the
estimation. Print a graph of the residuals. Predict the monthly returns over
the next 6 months. Plot a graph that shows the predicted values of returns
in red and the actual returns in blue.
rawdata= c(-0.21,-2.28,-2.71,2.26,-1.11,1.71,2.63,-0.45,-0.11,
4.79,5.07,-2.24,6.46, 3.82,4.29,-1.47,2.69,7.95,4.46,7.28,
3.43,-3.19,-3.14,-1.25,-0.50,2.25,2.77,6.72, 9.17,3.73,
6.72,6.04,10.62,9.89,8.23,5.37,-0.10,1.40,1.60,3.40,3.80,
3.60,4.90,9.60, 18.20,20.60,15.20,27.00,15.42,
39
25
20
15
10
0
5
r growth (%)
1996
1998
2000
2002
2004
2006
Time
Figure 7: Time Series of Asset Returns with Forecast Intervals.
40
13.31,11.22,12.77,12.43,15.83,11.44,12.32,12.10,
12.02,14.41,13.54,11.36,12.97,10.00,7.20,8.74,3.92,
8.73,2.19,3.85,1.48,2.28,2.98, 4.21,3.85,6.52,8.16,
5.36,8.58,7.00,10.57,7.12,7.95,7.05,3.84,4.93,4.30,5.44,
3.77, 4.71,3.18,0.00,5.25,4.27,5.14,3.53,
4.54,4.70,7.40,4.80,6.20,7.29,7.30,8.38,3.83,
8.07,4.88,8.17,8.25,6.46,5.96,5.88,5.03,4.99,5.87,
6.78,7.43,3.61,4.29,2.97,2.35,
2.49,1.56,2.65,2.49,2.85,1.89,3.05,2.27,2.91,3.94,
2.34,3.14,4.11,4.12,4.53,7.11, 6.17,6.25,7.03,4.13,
6.15,6.73,6.99,5.86,4.19,6.38,6.68,6.58,5.75,7.51,
6.22,8.22, 7.45,8.00,8.29,8.05,8.91,
6.83,7.33,8.52,8.62,9.80,10.63,7.70,8.91,7.50,5.88,9.82,
8.44,10.92,11.67)
summary(rawdata)
r <- ts(rawdata, frequency=12, start=c(1991, 4))
r <- window(r, start=c(1994, 11))
ar2 <- arima(r, order = c(2, 0, 0)) ; print(ar2)
tsdiag(ar2)
plot.ts(ar2$residual, ylab="residuals", col="blue", lwd=2)
p <- predict(ar2, n.ahead = 6) ; print(p)
ts.plot(r, p$pred, p$pred-1.96*p$se, p$pred+1.96*p$se,
gpars=list(lty=c(1, 1, 2, 2), lwd=c(2, 2, 1, 1),
ylab="r growth (%)", col=c("blue", "red", "red", "red")))
Exercises suggested by Jennifer Murray (JWM). The Answer /
Hint follows:
1.22
Exercise (Matrix Algebra: collinearity, eigenvalues, rank)
JWM-1) Demonstrate exact multicollinearity and eigenvalue decomposition.
To begin create a vector called x1 of values 1 to 5. Then create a vector
called x2 which is 2 times the vector x1. Create a vector of ones then bind
the three vectors into a matrix, called Z.
JWM-2) Next multiply the transpose of Z by Z itself and call it ZTZ. Compute the determinant of ZTZ, and then try to invert the ZTZ matrix.
41
JWM-3) Set a seed of 35 and create a random vector out of numbers between
3 and 22 by using the sample function. Select first 5 numbers from this list
and call it y. Regress y on x1 and x2. What is the coefficient on x2?
JWM-4) Load the package ‘fEcofin’ [10]. Solve for the eigenvectors and values
of ZTZ. Separately extract the eigenvectors and values. Call the eigenvector
matrix G, and the diagonal matrix of eigenvalues called Λ. Using these create
the eigenvector/value decomposition.
JWM-5) What is the rank of the ZTZ matrix? Is it possible to invert (solve)
this matrix?
#R Code for ALL Answers (output suppressed for brevity)
x1=(1:5)
x2=2*x1
ones=rep(1,5)
Z=cbind(ones,x1,x2)#we are creating a singular matrix
Z
ZTZ=t(Z)%*%Z #this is the Z'Z matrix
ZTZ
det(ZTZ) #this is zero!
solve(ZTZ)#impossible because it is singular
set.seed(34);
y= sample(3:22)[1:5]
#this is a random vector that we will regress on x1 and x2
reg1=lm(y~x1+x2)
reg1
#The x2 coefficient is not available.
#This is because x2 is a simple transformation of x1.
library(fEcofin)
evd=eigen(ZTZ)
evd
G=evd$vec #in doing this we are extracting the matrix of eigenvectors
G
Lamda=evd$val #now this extracts only the eigenvalues
Lamda
diag(Lamda) #this creates a matrix with eigen values on the diagonals
ev.decompose=G%*%diag(Lamda)%*%t(G)
#this formula should give us the eigen value decomposition!
42
ev.decompose #this should equal ZTZ
ZTZ #this is why it is called a decomposition
rk(ZTZ) #this function rk for rank is from fEcofin package
#notice that the rank of the matrix is 2, not 3.
#This is due to the fact that
#the matrix is singular!
inv(ZTZ)
#because it is singular, it is impossible to invert or solve this matrix.
1.23
Additional Exercises by H.D. Vinod
HDV1> Fit a trade-off relation between job market vacancies and unemployment rate (Beveridge Curve). compute the elasticity of substitution using
a non-homogeneous production function type method. Compare President
George W. Bush versus Obama era marginal elasticities and elasticities of
substitution. Plot and compare the isoquant level curves for two eras.
Hint: use the EOS discussion in the textbook section 1.8. Complete
solution with references is at: http://ssrn.com/abstract=2149316
HDV2> What are projection matrices? Give examples of at least 3 such
matrices. [Hint: I, H, M]
Analytically prove the properties for each of the 3 matrices. [Hint: Symmetric Idempotent]
Construct a numerical example of each and show that the properties hold
true numerically.
HDV3> Define VIF, collinearity, and ridge regression, Provide theoretical
discussion along with R examples.
HDV4) Derive the formula for the variance of ridge estimator. Describe
why shrinkage and declining deltas are desirable. Compare ridge regression
with principal components regression (PCR) as shrinkage estimators.
HDV5)Use R seed 10, create a long array of 100 uniform random numbers
in the range 10 to 60 and make 4 columns (y1 , x1 , x2 , x3 ) from these 100,
numbers using the first 1:25 for y1 , the next 26:50 for x1 and so forth. Replace
the second column x1 by the R command “x2+10*x3+rnorm(25)”. This is
designed to make x1 almost linearly dependent on x2 and x3 injecting “near”
collinearity.
43
(1) compute the OLS regression coefficient vector b when we regress y1
on y2 , x1 and x2 including an intercept.
(2) compute the G matrix and c, the uncorrelated components vector for
this regression.
(3) Define cpc2 as keep-2 components of c and zero out last two. Similarly
define cpc3 as keep-3 components of c and zero out the last one.
(4) Compute the bpc2 as keep-2 principal components PCR and bpc3 as
keep-3 PCR and make a table comparing b with these two PCRs side by
side.
ANS: We provide the R code and some output below.
set.seed(10)
xx=runif(100, min=10, max=60)
yy=matrix(xx, nrow=25)
y1=yy[,1]; x1=yy[,2]; x2=yy[,3]; x3=yy[,4]
x1=x1+10*x2+rnorm(25)
reg=lm(y1~x1+x2+x3)
summary(reg)
ane=rep(1,25); X=cbind(ane,x1,x2,x3)
X #upper case X
xtx=t(X)%*%X
ei=eigen(xtx)
G=ei$ve
G #matrix of eienvectors
Gt=t(G)
ei$va
b=coef(reg);b
c=Gt%*%b
c #vector of uncorrelated components of b
G%*%c #this should be b
cpc3=c
cpc3=c(c[1:3],0);cpc3
bpc3=G%*%cpc3; bpc3
cpc2=c(c[1:2],0,0);cpc2
bpc2=G%*%cpc2; bpc2
cb=cbind(b, bpc2,bpc3)
colnames(cb)[2]="bpc2"
colnames(cb)[3]="bpc3"
44
cb
Selected output from R is given below.
G #matrix of eienvectors
[,1]
[,2]
[,3]
[,4]
[1,] -0.002595253 0.01223696 -0.08445826 0.99634849
[2,] -0.992347544 -0.08130949 -0.09244637 -0.00942268
[3,] -0.090503971 -0.03282745 0.99178386 0.08423877
[4,] -0.083956231 0.99607298 0.02617727 -0.01023327
c #vector of uncorrelated components of b
[,1]
[1,] -0.08083593
[2,] 0.50432166
[3,] -3.84193890
[4,] 38.84746189
G%*%c #this should be b
[,1]
[1,] 39.03647482
[2,] 0.02833732
[3,] -0.54715023
[4,] 0.01101989
cpc3=c
cpc3=c(c[1:3],0);cpc3
[1] -0.08083593 0.50432166 -3.84193890 0.00000000
cb
b
bpc2
bpc3
(Intercept) 39.03647482 0.006381151 0.3308646
x1
0.02833732 0.039211195 0.3943845
x2
-0.54715023 -0.009239619 -3.8196126
x3
0.01101989 0.509127864 0.4085564
1.24
Further Exercises by H.D. Vinod
HDV-f1> Under what assumptions is there a relation between the coefficient of determination R2 and Pearson bivariate correlation coefficient r1,2 ?
Describe a formula stating the relation.
45
HDV-f2> Under what assumptions is there a relation between the coefficient of determination R2 and the regression coefficient when x1 is regressed
on x2 . Describe a formula stating the relation.
HDV-f3> How can one obtain the regression coefficients of the original
model regressing x1 on x2 from the coefficients of the model where the variables are standardized (e.g., x1s = (x1 − mean(x1))/sd(x1)). True or False:
Regression coefficient always equals the correlation coefficient of a standardized model?
HDV-f4> Under what assumptions is there a relation between the partial
correlation coefficient r1,2|3 and regression coefficient when x1 is regressed on
x2 and x3 . Describe a formula stating the relation.
The proposed answers to all questions in this set using R code uses a
common example. ANS-f1) In general, there is no relation between the R2
and Pearson bivariate correlation coefficient r1,2 , except in the (univariate)
case when there is only one regressor. Consider a multivariate regression:
x1 = β1,0 + β1,2 x2 + β1,3 x3 + .
(2)
It can be verified that the R2 of this model is not the square of any bivariate
correlation coefficients with the help of a simple numerical example.
rm(list=ls()) #clean up R memory
options(prompt = " ", continue = " ", width = 68,
useFancyQuotes = FALSE)
set.seed(89);da=sample(1:100);x1=da[1:10];
x2=da[11:20];x3=da[21:30]
reg=lm(x1~x2+x3)
Rsq=summary(reg)$r.squared;Rsq
root.rsq=sqrt(Rsq);root.rsq
c1=cor(cbind(x1,x2,x3));c1
√
We note the output for R2 is extracted from the output of the summary
function with the dollar command is different from all off-diagonal correlation
coefficients.
Rsq=summary(reg)$r.squared;Rsq
[1] 0.1026093
root.rsq=sqrt(Rsq);root.rsq
[1] 0.3203269
46
c1=cor(cbind(x1,x2,x3));c1
x1
x2
x3
x1 1.0000000 -0.261317096 -0.183168895
x2 -0.2613171 1.000000000 -0.007994397
x3 -0.1831689 -0.007994397 1.000000000
If we consider a univariate version of eq. (4), assuming that β1,3 ≡ 0, we
have a special case.
reg2=lm(x1~x2)
Rsq2=summary(reg2)$r.squared;Rsq2
root.rsq2=sqrt(Rsq2);root.rsq2
c2=cor(cbind(x1,x2));c2
Note that the univariate case output below has r12 = −0.2613171 =
where the minus sign is that of ordinary correlation coefficient.
√
R2 ,
Rsq2=summary(reg2)$r.squared;Rsq2
[1] 0.06828662
root.rsq2=sqrt(Rsq2);root.rsq2
[1] 0.2613171
c2=cor(cbind(x1,x2));c2
x1
x2
x1 1.0000000 -0.2613171
x2 -0.2613171 1.0000000
For the univariate special case, we have just verified that r12 equals the
signed square root of R2 :
√
r12 = sign(r12 ) R2 ,
(3)
where the sign(w) function is (+1,0,–1) if w is (> 0, 0, < 0), respectively.
Eq. (3) represents our formula stating the relation between signed square
root of R2 and r12 .
Now we turn to answering the second question. ANS-f2). Consider bivariate regression
x1 = β1,0 + β1,2 x2 .
(4)
Now standardize the data for both variables. That is, replace x1 by (x1 −
x̄1 )/s1 , where we subtract the mean and divide by the standard deviation
and do the same for x2 .
47
In the following code we define a general function stdze(x) and apply it to
x1 and x2. The standardized version of x1 and x2 is denoted as x1s and x2s.
Since we are measuring the variables from respective means, the intercept
should not be computed. This is accomplished in R by using command
‘reg3 = lm(x1s ∼ x2s − 1)’ below where the option ‘–1’ forces the line of
regression through the origin.
stdze=function(x) (x-mean(x))/sd(x)
x1s=stdze(x1);x2s=stdze(x2)
reg3=lm(x1s~x2s -1)#force through the origin
coef(reg3)
cor(x1,x2)
sign(cor(x1,x2))*sqrt(summary(reg3)$r.squared)
We find that the bivariate regression coefficient given by the code ‘coef(reg3)’
is the simple correlation coefficient and also the signed square root of the R2 .
coef(reg3)
x2s
-0.2613171
cor(x1,x2)
[1] -0.2613171
sign(cor(x1,x2))*sqrt(summary(reg3)$r.squared)
[1] -0.2613171
Now we turn to answering the third question. HDV-f3, which asks, in
the notation of the example the following question. How can one obtain the
regression coefficient b12 of the original model regressing x1 on x2 from the
coefficient r12 of the standardized model regressing x1s on x2s? The answer
is given by the formula for mapping from correlation coefficient to regression
coefficient:
sd(x1)
,
(5)
b12 = r12 ∗
sd(x2)
numerically verified in the following code.
b12=coef(lm(x1~x2))[2];b12
r12=cor(x1,x2);r12
r12*sd(x1)/sd(x2)
48
b12=coef(lm(x1~x2))[2];b12
x2
-0.3255658
r12=cor(x1,x2);r12
[1] -0.2613171
r12*sd(x1)/sd(x2)
[1] -0.3255658
Note that both b12 and correlation times ratio of standard deviations equal
-0.3255658.
Now we answer: True or False: Regression coefficient always equals the
correlation coefficient of a standardized model? This is true when there
is a univariate regressor. However, not true when there are two or more
regressors.
ANS-f4) Now we turn to answering the fourth question. involving the
partial correlation coefficients. Let us consider a multivariate regression of
x1 on x2 and x3 , where all are standardized.
x3s=stdze(x3)
reg4=lm(x1s~x2s+x3s)
coef(reg4)
c1=cor(cbind(x1,x2,x3));r12=c1[1,2]
r13=c1[1,3];r23=c1[2,3]
r12.3=(r12-r23*r13) /sqrt((1-r23^2)*(1-r13^2))
r12.3
x1.3s=resid(lm(x1s~x3s-1))
x2.3s=resid(lm(x2s~x3s-1))
r12.3*sd(x1.3s)/sd(x2.3s)
coef(reg4)[2]
In the above code the correlation matrix is denoted as ‘c1’ with elements
‘rij’. We use the standard formula for partial correlation coefficient between
x1 and x2 after removing the effect of x3 as ‘r12.3’.
Let x1.3s denote the residual after removing the effect of x3s on x1s and
let x2.3s denote the residual after removing the effect of x3s on x2s. Recall the
way to obtain regression coefficients from (unit-free) correlation coefficients
is to multiply them by the ratio of standard deviations, sd(x1.3s)/sd(x2.3s),
analogous to the formula given in eq. (5) for mapping from correlation coefficient to regression coefficient. However, the need to include the ratio of
49
standard deviations shows that the answer to “True or False: Regression coefficient always equals the correlation coefficient of a standardized model?”
must be that it is NOT always true, though it is true when there is only one
regressor.
The output is
r12.3=(r12-r23*r13) /sqrt((1-r23^2)*(1-r13^2))
r12.3
[1] -0.2673124
x1.3s=resid(lm(x1s~x3s-1))
x2.3s=resid(lm(x2s~x3s-1))
r12.3*sd(x1.3s)/sd(x2.3s)
[1] -0.2627982
coef(reg4)[2]
x2s
-0.2627982
An R package ‘generalCorr’, [6] has a function to compute partial correlation coefficients if the correlation matrix has i, j = 1, . . . , p, p ≥ 2 variables.
If ‘c1’ has the correlation matrix, the following commands yield the partial
correlation coefficient between x1 and x2 after removing the effect of x3, x4,
.., xp. Although there is no limit on p, we illustrate the computations for the
example when p = 3 used above, for brevity.
library(generalCorr)
parcor_ijk(c1,1,2)
parcor_ijk(c1,1,3)
The output of ‘parcor ijk’ has three parts (accessible with the dollar symbol): ‘ouij’ reports partial correlation between i-th and j-th variable after
removing the effect of all others with column numbers listed in the third part
called ‘myk.’ R also reports ‘ouji’ which is partial correlation between j-th
and i-th variable after removing the effect of all others. If the correlation
matrix is symmetric, as our matrix called ‘c1’ in the code above, ouij and
ouji must be equal to each other.
parcor_ijk(c1,1,2)
$ouij
[1] -0.2673124
50
$ouji
[1] -0.2673124
$myk
[1] 3
parcor_ijk(c1,1,3)
$ouij
[1] 0.191933
$ouji
[1] 0.191933
$myk
[1] 2
If the correlation matrix is non-symmetric, as with generalized correlation coefficients from [8], ‘ouij’ differs from ‘ouji’. This difference provides
an interesting new interpretation described in [8]. For example, if ‘ouij’ exceeds ‘ouji’ the nonparametric kernel regression model xi = f (xj) is superior
compared to the flipped model xj = f 0 (xi), implying that xj is a better predictor, and hence xj ‘kernel causes’ xi even after controlling for any number
of confounding variables in xk.
The function ‘gmcmtx0’ of ‘generalCorr’ package reports the non-symmetric
matrix R∗ of generalized correlation coefficients. The function ‘parcor ijk’
uses the R∗ matrix as its argument, yielding distinct ouij and ouji as illustrated below.
options(np.messages=FALSE)
c2=gmcmtx0(cbind(x1,x2,x3));c2
parcor_ijk(c2,1,2)
Note that the c2 matrix is asymmetric. Hence partial correlation coefficients
are distinct.
c2=gmcmtx0(cbind(x1,x2,x3));c2
x1
x2
x3
x1 1.0000000 -0.5556230 -0.1159349
x2 -0.2308683 1.0000000 0.0000000
x3 -0.7299638 -0.8796286 1.0000000
parcor_ijk(c2,1,2)
$ouij
[1] -0.2413045
51
$ouji
[1] -0.6873292
$myk
[1] 3
Since the magnitude of ‘ouji’ exceeds ‘ouij’ x1 is likely the kernel cause of x2
after removing the effect of x3, i.e., after controlling for x3.
1.25
Further Advanced Exercises by H.D. Vinod
Can we view multiple regression as a series of regressions involving transformed variables with only one regressor at a time? For example, obtain the
regression coefficients when we regress x1 on x2 and x3 from two bivariate
regressions of ‘transformed x1’ on ‘transformed x2’ and a separate regression
of ‘transformed x1’ on ‘transformed x3.’ (Hint: Use Frisch-waugh Theorem
described in Chapter 11, p 452–454.
x1 = β1,0 + β1,2 x2 + β1,3 x3 + .
(6)
We want OLS estimates of the slope coefficient β1,2 from transformed
bivariate regression
T
T
xT1 = β1,0
+ β1,2
xT2 + T .
(7)
Let the transformation xT1 be defined as the residual of the auxiliary
regression of x1 on x3. Similarly, let the transformation xT2 be defined as
the residual of the auxiliary regression of x2 on x3. If this transformation is
used, Frisch-Waugh theorem described in Ch. 11 pages 452 assures us that
the OLS estimates satisfy:
T
β1,0
= β1,0
(8)
We can check this using the example of the previous section.
co1=coef(lm(x1~x2+x3));co1[2]
r1=resid(lm(x1~x3))
r2=resid(lm(x2~x3))
coef(lm(r1~r2))[2]
The ‘resid’ function of R computes the residuals conveniently. We note that
the slope coefficient agrees.
52
co1=coef(lm(x1~x2+x3));co1[2]
x2
-0.3274111
r1=resid(lm(x1~x3))
r2=resid(lm(x2~x3))
coef(lm(r1~r2))[2]
r2
-0.3274111
Both slopes are seen to be exactly -0.3274111. This is no accident, but a
result arising from the Frisch-Waugh theorem.
Similarly defined transformed regression yields the other slope coefficient
β1,3 from residuals when regressing on x2:
co1=coef(lm(x1~x2+x3));co1[3]
r1x=resid(lm(x1~x2))
r2x=resid(lm(x3~x2))
coef(lm(r1x~r2x))[2]
The output (omitted for brevity) shows that the second slope coefficient
-0.4124538 is also the regression coefficient when we regress the new residuals
r1x on r2x.
References
[1] Achim Zeileis, Friedrich Leisch, Kurt Hornik, and Christian Kleiber,
strucchange: An R package for testing for structural change in linear
regression models, Journal of Statistical Software 7 (2002), no. 2, 1–38.
[2] Yves Croissant, Ecdat: Data sets for econometrics, 2006, R package
version 0.1-5.
[3] J. Fox, car: Companion to applied regression. r package version 1.2-14,
2009.
[4] Christian Kleiber and Achim Zeileis, Applied econometrics with R,
Springer-Verlag, New York, 2008, ISBN 978-0-387-77316-2.
[5] Adrian Trapletti and Kurt Hornik, tseries: Time series analysis and
computational finance, 2012, R package version 0.10-28.
53
[6] H D Vinod, generalcorr: Generalized correlations and initial causal path,
2016, Fordham University, New York, R package version 1.0.0, May 2,
2016.
[7] Hrishikesh D. Vinod, Hands-on matrix algebra using R: Active and motivated learning with applications, World Scientific, Hackensack, NJ, 2011,
ISBN 978-981-4313-68-1.
[8] Hrishikesh D Vinod, Matrix algebra topics in statistics and economics
using R, Handbook of Statistics: Computational Statistics with R (M. B.
Rao and C. R. Rao, eds.), vol. 34, North Holland, Elsevier Science, New
York, 2014, pp. 143–176.
[9] Diethelm Wuertz and Rmetrics core team, fbasics: Rmetrics - markets
and basic statistics, 2010, R package version 2110.79.
[10] Diethelm Wuertz and many others. See the SOURCE file, fEcofin: Economic and financial data sets, 2008, R package version 270.75.
[11] Achim Zeileis and Torsten Hothorn, Diagnostic checking in regression
relationships, R News 2 (2002), no. 3, 7–10.
54