Download Answers to Homework #1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Answers to Homework #1
The dataset is the Framinghamold.dta and the variable of interest is BMI (body mass
index).
http://www.cdc.gov/nccdphp/dnpa/healthyweight/assessing/bmi/adult_BMI/about
_adult_BMI.htm
I would like you to create a log file which you will turn in as part of the homework.
I want you to use Stata to find out the following things:
1.
Do you have complete data for BMI and if not, how many people do not have a
value for BMI recorded?
The easiest way to determine if there are missing values and how many missing
values there are is to use the command codebook. There are a total of 4434
people in the Framingham dataset and 19 of them have missing values for BMI
(see highlighted line below).
. codebook bmi
----------------------------------------------------------------------------bmi
body mass index (kr/(m*m)
----------------------------------------------------------------------------type:
range:
unique values:
mean:
std. dev:
percentiles:
2.
numeric (float)
[15.54,56.8]
1393
units:
missing .:
.01
19/4434
25.8462
4.10182
10%
21.08
25%
23.09
50%
25.45
75%
28.09
90%
30.85
If there are missing data, how many women don’t have a value for BMI
recorded?
One way to get the answer to this question is to use the tabulate command with
sex while selecting only those who have a missing value for BMI. You can see
below that 14 of the 19 people missing values of BMI are women.
Page -1-
. tab sex if bmi == .
Sex |
Freq.
Percent
Cum.
------------+----------------------------------Male |
5
26.32
26.32
Female |
14
73.68
100.00
------------+----------------------------------Total |
19
100.00
3.
For the variable BMI graph box-and-whisker plots for men and women.
Describe the differences you see in the plots for men and women with respect to
location and variability.
. graph box bmi, by(sex)
The height of the box for the women is larger than that for the men indicating
that the interquartile range of the women is larger than that for the women.
Notice that for the men the distance from the point with the lowest BMI (which
happens to be outside the lower whisker) to the point with the largest BMI
(which is outside the upper whisker) is considerably shorter than the same
Page -2-
distance for the women (the answer you get using the lower whisker for the
smallest BMI shouldn’t be too far from the smallest value). This says the range
for the women is larger than the range for the men.
The two observations above both point to more variability in the BMI for women
than that for men.
There is also a longer upper tail for the women than for the men (i.e. the
skewness for women is larger than that for men).
The fact that the median line for the women is not in the center of the box also
indicates skewness. The median line for the men is much more centered than
that of the women. The median line for the women is lower than that for the
men, indicating that the median for the men is larger than the median for the
women.
4.
For the men and women separately, please give the values of the statistics
listed in the table below (I didn’t ask you for skewness):
Body mass index kg/m2
Men
Women
Mean kg/m2
26.2
25.6
Median kg/m2
(50th percentile)
26.1
24.8
Variance (kg/m2)2
11.61
20.77
Standard deviation kg/m2
(square root of the variance)
3.41
4.56
Standard error kg/m2
(standard deviation divided by the square root
of the sample size)
0.08
0.09
Range kg/m2
(Largest value of BMI - smallest value of BMI)
40.38 - 15.54 = 24.84
56.80 - 15.96 = 40.84
Interquartile range kg/m2
(75th percentile - 25th percentile)
28.32 - 23.97 = 4.35
27.82 - 22.54 = 5.28
Skewness
0.33
1.24
Notice that all of the measures of location are larger for the men than the women but all
Page -3-
of the measures of variability are smaller for the men than the women.
Also notice that the numbers in the table agree with our description of the box-andwhisker plots.
. bysort sex: sum(bmi),det
--------------------------------------------------------------------------> sex = Male
body mass index (kr/(m*m)
------------------------------------------------------------Percentiles
Smallest
1%
18.88
15.54
5%
20.56
16.59
10%
21.86
16.87
Obs
1939
25%
23.97
16.98
Sum of Wgt.
1939
50%
75%
90%
95%
99%
26.08
28.32
30.41
31.8
35.31
Largest
39.88
40.08
40.11
40.38
Mean
Std. Dev.
26.16958
3.407115
Variance
Skewness
Kurtosis
11.60843
.3309721
3.69707
--------------------------------------------------------------------------> sex = Female
body mass index (kr/(m*m)
------------------------------------------------------------Percentiles
Smallest
1%
17.93
15.96
5%
19.68
16.48
10%
20.68
16.59
Obs
2476
25%
22.54
16.61
Sum of Wgt.
2476
50%
75%
90%
95%
99%
24.83
27.82
31.37
34.25
40.23
Largest
45.79
45.8
51.28
56.8
Mean
Std. Dev.
25.59288
4.557443
Variance
Skewness
Kurtosis
20.77029
1.239948
5.861763
In order not to have to calculate by hand the standard error. I have used the mean
command below. I also show you how to tell what numbers represent men and women.
Page -4-
. label list
sexlbl:
1
2
bmigrplbl:
1
2
3
4
Male
Female
< 18.5
[18.5,25)
[25,30)
30+
. mean bmi if sex == 1
(From the label list above we know these are the results for the men)
Mean estimation
Number of obs
=
1939
-------------------------------------------------------------|
Mean
Std. Err.
[95% Conf. Interval]
-------------+-----------------------------------------------bmi |
26.16958
.0773745
26.01784
26.32133
-------------------------------------------------------------. mean bmi if sex == 2
Mean estimation
Number of obs
=
2476
-------------------------------------------------------------|
Mean
Std. Err.
[95% Conf. Interval]
-------------+-----------------------------------------------bmi |
25.59288
.0915896
25.41328
25.77248
--------------------------------------------------------------
5.
At the bottom of the list of variables in this file you will find a variable called
bmi_grp. The variable allows you to quickly find out how many people are
classified as underweight, normal weight, overweight and obese. For normal
weight the CDC website uses a BMI of greater than or equal to 18.5 and less
than 24.9. The variable bmi_grp uses an interval written as [18.5, 30). The
squared bracket “[” means greater than or equal to 18.5. The rounded bracket
“)” means less than 30. So the CDC categories and the categories of the
variable bmi_grp are essentially the same.
Describe the differences in the distributions of men and women across the 4
CDC categories.
Notice below that the number of women is somewhat larger than the number of
men. This means that you shouldn’t compare the raw numbers because that
can be misleading.
The most important feature in the table below is the fact that approximately 50%
of the men are in the [25, 30) category whereas approximately 50% of the
women are in the [18.5, 25) category.
Page -5-
. tab bmi_grp sex,col
+-------------------+
| Key
|
|-------------------|
|
frequency
|
| column percentage |
+-------------------+
BMI |
Sex
categories |
Male
Female |
Total
-----------+----------------------+---------< 18.5 |
12
45 |
57
|
0.62
1.82 |
1.29
-----------+----------------------+---------[18.5,25) |
703
1,233 |
1,936
|
36.26
49.80 |
43.85
-----------+----------------------+---------[25,30) |
992
853 |
1,845
|
51.16
34.45 |
41.79
-----------+----------------------+---------30+ |
232
345 |
577
|
11.96
13.93 |
13.07
-----------+----------------------+---------Total |
1,939
2,476 |
4,415
|
100.00
100.00 |
100.00
Page -6-