Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Answers to Homework #1 The dataset is the Framinghamold.dta and the variable of interest is BMI (body mass index). http://www.cdc.gov/nccdphp/dnpa/healthyweight/assessing/bmi/adult_BMI/about _adult_BMI.htm I would like you to create a log file which you will turn in as part of the homework. I want you to use Stata to find out the following things: 1. Do you have complete data for BMI and if not, how many people do not have a value for BMI recorded? The easiest way to determine if there are missing values and how many missing values there are is to use the command codebook. There are a total of 4434 people in the Framingham dataset and 19 of them have missing values for BMI (see highlighted line below). . codebook bmi ----------------------------------------------------------------------------bmi body mass index (kr/(m*m) ----------------------------------------------------------------------------type: range: unique values: mean: std. dev: percentiles: 2. numeric (float) [15.54,56.8] 1393 units: missing .: .01 19/4434 25.8462 4.10182 10% 21.08 25% 23.09 50% 25.45 75% 28.09 90% 30.85 If there are missing data, how many women don’t have a value for BMI recorded? One way to get the answer to this question is to use the tabulate command with sex while selecting only those who have a missing value for BMI. You can see below that 14 of the 19 people missing values of BMI are women. Page -1- . tab sex if bmi == . Sex | Freq. Percent Cum. ------------+----------------------------------Male | 5 26.32 26.32 Female | 14 73.68 100.00 ------------+----------------------------------Total | 19 100.00 3. For the variable BMI graph box-and-whisker plots for men and women. Describe the differences you see in the plots for men and women with respect to location and variability. . graph box bmi, by(sex) The height of the box for the women is larger than that for the men indicating that the interquartile range of the women is larger than that for the women. Notice that for the men the distance from the point with the lowest BMI (which happens to be outside the lower whisker) to the point with the largest BMI (which is outside the upper whisker) is considerably shorter than the same Page -2- distance for the women (the answer you get using the lower whisker for the smallest BMI shouldn’t be too far from the smallest value). This says the range for the women is larger than the range for the men. The two observations above both point to more variability in the BMI for women than that for men. There is also a longer upper tail for the women than for the men (i.e. the skewness for women is larger than that for men). The fact that the median line for the women is not in the center of the box also indicates skewness. The median line for the men is much more centered than that of the women. The median line for the women is lower than that for the men, indicating that the median for the men is larger than the median for the women. 4. For the men and women separately, please give the values of the statistics listed in the table below (I didn’t ask you for skewness): Body mass index kg/m2 Men Women Mean kg/m2 26.2 25.6 Median kg/m2 (50th percentile) 26.1 24.8 Variance (kg/m2)2 11.61 20.77 Standard deviation kg/m2 (square root of the variance) 3.41 4.56 Standard error kg/m2 (standard deviation divided by the square root of the sample size) 0.08 0.09 Range kg/m2 (Largest value of BMI - smallest value of BMI) 40.38 - 15.54 = 24.84 56.80 - 15.96 = 40.84 Interquartile range kg/m2 (75th percentile - 25th percentile) 28.32 - 23.97 = 4.35 27.82 - 22.54 = 5.28 Skewness 0.33 1.24 Notice that all of the measures of location are larger for the men than the women but all Page -3- of the measures of variability are smaller for the men than the women. Also notice that the numbers in the table agree with our description of the box-andwhisker plots. . bysort sex: sum(bmi),det --------------------------------------------------------------------------> sex = Male body mass index (kr/(m*m) ------------------------------------------------------------Percentiles Smallest 1% 18.88 15.54 5% 20.56 16.59 10% 21.86 16.87 Obs 1939 25% 23.97 16.98 Sum of Wgt. 1939 50% 75% 90% 95% 99% 26.08 28.32 30.41 31.8 35.31 Largest 39.88 40.08 40.11 40.38 Mean Std. Dev. 26.16958 3.407115 Variance Skewness Kurtosis 11.60843 .3309721 3.69707 --------------------------------------------------------------------------> sex = Female body mass index (kr/(m*m) ------------------------------------------------------------Percentiles Smallest 1% 17.93 15.96 5% 19.68 16.48 10% 20.68 16.59 Obs 2476 25% 22.54 16.61 Sum of Wgt. 2476 50% 75% 90% 95% 99% 24.83 27.82 31.37 34.25 40.23 Largest 45.79 45.8 51.28 56.8 Mean Std. Dev. 25.59288 4.557443 Variance Skewness Kurtosis 20.77029 1.239948 5.861763 In order not to have to calculate by hand the standard error. I have used the mean command below. I also show you how to tell what numbers represent men and women. Page -4- . label list sexlbl: 1 2 bmigrplbl: 1 2 3 4 Male Female < 18.5 [18.5,25) [25,30) 30+ . mean bmi if sex == 1 (From the label list above we know these are the results for the men) Mean estimation Number of obs = 1939 -------------------------------------------------------------| Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------bmi | 26.16958 .0773745 26.01784 26.32133 -------------------------------------------------------------. mean bmi if sex == 2 Mean estimation Number of obs = 2476 -------------------------------------------------------------| Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------bmi | 25.59288 .0915896 25.41328 25.77248 -------------------------------------------------------------- 5. At the bottom of the list of variables in this file you will find a variable called bmi_grp. The variable allows you to quickly find out how many people are classified as underweight, normal weight, overweight and obese. For normal weight the CDC website uses a BMI of greater than or equal to 18.5 and less than 24.9. The variable bmi_grp uses an interval written as [18.5, 30). The squared bracket “[” means greater than or equal to 18.5. The rounded bracket “)” means less than 30. So the CDC categories and the categories of the variable bmi_grp are essentially the same. Describe the differences in the distributions of men and women across the 4 CDC categories. Notice below that the number of women is somewhat larger than the number of men. This means that you shouldn’t compare the raw numbers because that can be misleading. The most important feature in the table below is the fact that approximately 50% of the men are in the [25, 30) category whereas approximately 50% of the women are in the [18.5, 25) category. Page -5- . tab bmi_grp sex,col +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ BMI | Sex categories | Male Female | Total -----------+----------------------+---------< 18.5 | 12 45 | 57 | 0.62 1.82 | 1.29 -----------+----------------------+---------[18.5,25) | 703 1,233 | 1,936 | 36.26 49.80 | 43.85 -----------+----------------------+---------[25,30) | 992 853 | 1,845 | 51.16 34.45 | 41.79 -----------+----------------------+---------30+ | 232 345 | 577 | 11.96 13.93 | 13.07 -----------+----------------------+---------Total | 1,939 2,476 | 4,415 | 100.00 100.00 | 100.00 Page -6-