Download Homework 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Statistics 511
Homework 8
Fall 2006
Due Friday Nov 10.
1. This is a continuation of Homework 7.
Body Density is an important health indicator in humans, but is difficult to measure. One direct measure is
to immerse the person entirely in water, and measure the amount of water displaced. It would be preferable
to accurately predict body density from measurements taken more readily in the doctor's office.
In a 1974 study, body density was determined for 252 volunteers using the immersion method. The
following measurements were taken:
DENSITY
FAT
AGE
WEIGHT
HEIGHT
NECK
CHEST
ABDOMEN
HIP
THIGH
KNEE
ANKLE
BICEPS
FOREARM
WRIST
body density
body fat determined from underwater weighing
Age (years)
Weight (lbs)
Height (inches)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm)
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Biceps (extended) circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm)
We are going to focus on 4 predictors of FAT, which are all easily measured: WEIGHT,
ABDOMEN, THIGH and WRIST.
a. Do a simultaneous test of whether THIGH and WRIST are significant regressors when
WEIGHT and ABDOMEN are in the model.
b. Compute the Variance Inflation Factors for all of the variables in the model. Is there
evidence of multicollinearity?
c. Take a look at the Partial Regression Plots. Note any interesting features such as curvature or
extremely outlying points.
d. Include the Partial Regression plot for WEIGHT with your homework. (Note that this can be
cut and pasted like regular output, as it is in the Output Window, not the Graphics Window.)
MODEL Y=X1 X2/PARTIAL VIF TOL;
PARTIAL creates the Partial Regression Plots.
VIF prints the variance inflation factors.
TOL prints the tolerance.
2. Ecological analysis: An analysis in which the sampling unit is the mean over a subpopulation.
e.g. The connection between birth control hormone use and heart disease in young women was
first proposed when it was noted that the rate of heart attacks in premenopausal women was
increasing in countries in which the use of birth control pills had become popular, and the
regression of “heart attack rate” and “rate of birth control use” was shown to have a positive
slope.
2
15
Consider the figure below (which is purely imaginary). (Note however, that some ecological
studies do show patterns somewhat similar to this figure.) Explain how an ecological
analysis can contradict results of a subgroup by subgroup analysis.
10
India
y
5
France
Canada
-5
0
PRC
-10
USA
0
10
20
30
40
50
x
3. Regression to the mean: The term regression was coined by Francis Galton, who was studying
the relationship between the heights of fathers and the heights of their sons. He fitted the
relationship by what is now called the linear regression. Galton noticed that the fathers who were
most extreme in height had sons that were less extreme and considered this an example of
“regressing” which literally means “returning to a less advanced state”.
a) For simplicity, lets assume that in our sample of fathers and sons, the data are normally
distributed and both the fathers and sons have mean height 70 inches and s.d. of height=2
inches. The correlation between height of father (F) and height of son (S) is 0.7.
i.
ii.
iii.
iv.
v.
Suppose Y=21+0.7X. Invert the equation to express X as a function of Y.
What is the slope?
What is the fitted least squares regression equation (Y=S, X=F)?
What is the predicted son’s height if the father’s height is 76 inches?
What is the fitted least squares regression equation (Y=F, X=S)?
What is the predicted father’s height if the son’s height is 74.2 inches?
b) Another way to think about this is to consider the factors that influence the height of both
father and son – shared genetic and environmental factors. Since the heights of identical
twins who grow up in the same household is almost identical
(http://serendip.brynmawr.edu/biology/b103/f00/web3/hayesconroya3.html) for simplicity let
us suppose that
F=W+u
S=W+v
Where W is the component of height caused by the shared factor, and u and v are
independent components of height (which are also independent of each other). W, u and v
are all random Normals.
3
i.
ii.
iii.
iv.
v.
Show that if E(F)=E(S) then E(u)=E(v).
Show that if Var(F)=Var(S) then Var(u)=Var(v).
If correlation(F,S)=0.7 and s.d.(F)=s.d.(S)=2.0, what is s.d.(W)?
For simplicity in this model, we usually assume that E(u)=E(v)=0.
Now consider a man who is 76 inches tall. Is it more likely that his value of u is positive or
negative?
Consider the son of the man in iv. The son has the same value of W, but an
independent value of v which is Normally distributed with mean 0. Is the son more likely to
be taller or shorter than the father? Why?
Note: (a) and (b) give 2 alternative ways of thinking about regression to the mean. (a - the
regression model) shows that regression to the mean is due to the fact that the slope of the inverse
regression is not the inverse of the regression slope (which it would be if there were no error. (b
– the factor analysis model) shows that regression to the mean is due to the fact that when X and
Y are both random with a “common factor”, an outlying value of X is achieved by have BOTH an
outlying value of the common factor and an outlying value of the “unique” factor, while the
associated Y, on average, has only an outlying value of the common factor, and is therefore “less
outlying”.