Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 511 Homework 8 Fall 2006 Due Friday Nov 10. 1. This is a continuation of Homework 7. Body Density is an important health indicator in humans, but is difficult to measure. One direct measure is to immerse the person entirely in water, and measure the amount of water displaced. It would be preferable to accurately predict body density from measurements taken more readily in the doctor's office. In a 1974 study, body density was determined for 252 volunteers using the immersion method. The following measurements were taken: DENSITY FAT AGE WEIGHT HEIGHT NECK CHEST ABDOMEN HIP THIGH KNEE ANKLE BICEPS FOREARM WRIST body density body fat determined from underwater weighing Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm) Ankle circumference (cm) Biceps (extended) circumference (cm) Forearm circumference (cm) Wrist circumference (cm) We are going to focus on 4 predictors of FAT, which are all easily measured: WEIGHT, ABDOMEN, THIGH and WRIST. a. Do a simultaneous test of whether THIGH and WRIST are significant regressors when WEIGHT and ABDOMEN are in the model. b. Compute the Variance Inflation Factors for all of the variables in the model. Is there evidence of multicollinearity? c. Take a look at the Partial Regression Plots. Note any interesting features such as curvature or extremely outlying points. d. Include the Partial Regression plot for WEIGHT with your homework. (Note that this can be cut and pasted like regular output, as it is in the Output Window, not the Graphics Window.) MODEL Y=X1 X2/PARTIAL VIF TOL; PARTIAL creates the Partial Regression Plots. VIF prints the variance inflation factors. TOL prints the tolerance. 2. Ecological analysis: An analysis in which the sampling unit is the mean over a subpopulation. e.g. The connection between birth control hormone use and heart disease in young women was first proposed when it was noted that the rate of heart attacks in premenopausal women was increasing in countries in which the use of birth control pills had become popular, and the regression of “heart attack rate” and “rate of birth control use” was shown to have a positive slope. 2 15 Consider the figure below (which is purely imaginary). (Note however, that some ecological studies do show patterns somewhat similar to this figure.) Explain how an ecological analysis can contradict results of a subgroup by subgroup analysis. 10 India y 5 France Canada -5 0 PRC -10 USA 0 10 20 30 40 50 x 3. Regression to the mean: The term regression was coined by Francis Galton, who was studying the relationship between the heights of fathers and the heights of their sons. He fitted the relationship by what is now called the linear regression. Galton noticed that the fathers who were most extreme in height had sons that were less extreme and considered this an example of “regressing” which literally means “returning to a less advanced state”. a) For simplicity, lets assume that in our sample of fathers and sons, the data are normally distributed and both the fathers and sons have mean height 70 inches and s.d. of height=2 inches. The correlation between height of father (F) and height of son (S) is 0.7. i. ii. iii. iv. v. Suppose Y=21+0.7X. Invert the equation to express X as a function of Y. What is the slope? What is the fitted least squares regression equation (Y=S, X=F)? What is the predicted son’s height if the father’s height is 76 inches? What is the fitted least squares regression equation (Y=F, X=S)? What is the predicted father’s height if the son’s height is 74.2 inches? b) Another way to think about this is to consider the factors that influence the height of both father and son – shared genetic and environmental factors. Since the heights of identical twins who grow up in the same household is almost identical (http://serendip.brynmawr.edu/biology/b103/f00/web3/hayesconroya3.html) for simplicity let us suppose that F=W+u S=W+v Where W is the component of height caused by the shared factor, and u and v are independent components of height (which are also independent of each other). W, u and v are all random Normals. 3 i. ii. iii. iv. v. Show that if E(F)=E(S) then E(u)=E(v). Show that if Var(F)=Var(S) then Var(u)=Var(v). If correlation(F,S)=0.7 and s.d.(F)=s.d.(S)=2.0, what is s.d.(W)? For simplicity in this model, we usually assume that E(u)=E(v)=0. Now consider a man who is 76 inches tall. Is it more likely that his value of u is positive or negative? Consider the son of the man in iv. The son has the same value of W, but an independent value of v which is Normally distributed with mean 0. Is the son more likely to be taller or shorter than the father? Why? Note: (a) and (b) give 2 alternative ways of thinking about regression to the mean. (a - the regression model) shows that regression to the mean is due to the fact that the slope of the inverse regression is not the inverse of the regression slope (which it would be if there were no error. (b – the factor analysis model) shows that regression to the mean is due to the fact that when X and Y are both random with a “common factor”, an outlying value of X is achieved by have BOTH an outlying value of the common factor and an outlying value of the “unique” factor, while the associated Y, on average, has only an outlying value of the common factor, and is therefore “less outlying”.