Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2-99. Homework Problem Solutions Chapter 2-1. Describing variables, levels of measurement, and choice of descriptive statistics Problem 1) Read in the data file From the Stata menu bar, click on File on the menu bar, find the directory datasets & do-files, which is a subdirectory of the course manual, and open the file: births_with_missing.dta. Your directory path will differ, but something like the following will be displayed in the Stata Results window, use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ births_with_missing.dta", clear Problem 2) Listing data List the data for bweight. a) At the first “—more—” prompt, with the cursor in the Command window, hit the enter key a couple of times (notice this scrolls one line at a time). b) With the cursor in the Command window, hit the space bar a couple of times (notice this scrolls a page at a time). c) Click on the “—more—” prompt with the mouse (notice this scrolls a page at a time, as well) d) We have seen enough. Hit the stop icon on the menu bar (the red dot with a white X in the middle of it). This terminates (breaks) the output. list bweight 1. 2. 3. 4. 5. 6. +---------+ | bweight | |---------| | 2974 | | 3270 | | 2620 | | 3751 | | 3200 | |---------| | 3673 | … ______________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine. Chapter 2-99. (Accessed January 8, 2012, at http://www.ccts.utah.edu/biostats/ ?pageId=5385). Chapter 2-99 (revision 8 Jan 2012) p. 1 Problem 3) Frequency table Create a frequency table for the variable lowbw. tabulate lowbw * <or> tab lowbw low birth | weight | Freq. Percent Cum. ------------+----------------------------------0 | 420 87.87 87.87 1 | 58 12.13 100.00 ------------+----------------------------------Total | 478 100.00 Problem 4) Histogram Create a histogram for the variable gestwks, asking for the percent on the y-axis, rather than proportions (density). 15 0 5 10 Percent 20 25 histogram gestwks , percent 25 30 Chapter 2-99 (revision 8 Jan 2012) 35 gestation period 40 45 p. 2 Problem 5) Kernal Density Plot Create a kernal density for the variable gestwks, overlaying the graphs for male and female newborns. The following will work in the do-file editor. If you did this in the Command window, you would exclude the command continuation symbol “///” and make the command just one long command on the same line. .2 .1 0 kdensity gestwks .3 twoway (kdensity gestwks if sexalph == "female", lcolor(pink)) /// (kdensity gestwks if sexalph == "male" , lcolor(blue)) 25 30 kdensity gestwks Chapter 2-99 (revision 8 Jan 2012) 35 x 40 45 kdensity gestwks p. 3 Problem 6) Box plot Create a boxplot for the variable bweight, showing male and female newborns on the same graph. 3,000 2,000 1,000 birth weight in grams 4,000 5,000 graph box bweight, over(sexalph) female male Problem 7) Visualizing Distribution From Descriptive Statistics A variable has the following descriptive statistics: Mean = 45 Median = 50 SD = 3 Is this distribution symmetrical, or is it skewed? If skewed, is it left or right skewed? The distribution is left skewed. If it were symmetrical, the mean and median would be been very close to the same number. You can see that the mean is 5 less than the median, which is (mean – median)/SD = (45-50)/3 = -5/3 = -1.67 SDs apart. Recall that a normal distribution, which is symmetrical, has six SDs from the minimum to the maximum, approximately (middle 99.7% of distribution). So -1.67 SD, which is nearly -2 SDs, is about 1/3 of the distribution apart, which would be a very noticable skewness if the distribution was graphed. Since the mean is to the left of the median, it is said to be “left” skewed—the long tail is in the direction of the skewness. Chapter 2-99 (revision 8 Jan 2012) p. 4 Problem 8) Visualizing Distribution From Descriptive Statistics A variable has the following descriptive statistics: Mean = 50 Median = 49 SD = 10 Is this distribution symmetrical, or is it skewed? If skewed, is it left or right skewed? Although it would technically be correct to say it was right skewed, since the mean is greater than the median, the distribution is symmetrical for all practical purposes. Even though the mean and median differ by 1 point, the difference is only (mean - median)/SD = 1/10 SD apart, which would hardly be noticeable if the histogram was displayed. Munro (2001, p.42) provides Pearson’s skewness coefficient as a way to assess skewness, which is what was used in the preceding paragraph: Pearson’s skewness coefficient is skewness = (mean – median)/SD Notice that the “sign” of the Pearson skewness coefficient is in agreement with the concept of “left” and “right” skewness, being negative or positive skewness on the number line (to the left or right on the number line). Munro (2001, p.43) gives Hidebrand’s rule-of-thumb to think about skewness, “Hidebrand (1986) states that skewness values above 0.2 or below -0.2 indicate severe skewness.” The 1/10 SD, or 0.1 SD, is within the -0.2 to 0.2 range, so the skewness is not severe enough to be of practical concern. There is also an official statistic called skewness, which is given by the summarize command with the detail option. NOTE: A discussion on the assessment of skewness was not even provided in Chapter 21. It turns out that it is a relatively unimportant concept. Statistical tests, such as the t test to be covered later in this course, assume a normal, or symmetrical distribution. However, the test is very robust to violations of this assumption, giving correct results anyway. Chapter 2-99 (revision 8 Jan 2012) p. 5 Problem 9) Descriptive Statistics Obtain the descriptive statistics, including the median (50th percentile) for the variable bweight. summarize bweight , detail * <or> sum bweight , detail birth weight in grams ------------------------------------------------------------Percentiles Smallest 1% 924 628 5% 1801 693 10% 2399 708 Obs 478 25% 2878 864 Sum of Wgt. 478 50% 75% 90% 95% 99% 3192.5 3551 3804 4041 4423 Largest 4436 4512 4516 4553 Mean Std. Dev. 3137.253 637.777 Variance Skewness Kurtosis 406759.5 -1.039337 5.094602 Problem 10) Descriptive Statistics by Group Obtain the short list of descriptive statistics (N, mean, SD, min, max) for variable bweight, for both males and females. bysort sexalph: sum bweight --------------------------------------------------------------------------------> sexalph = Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 41 2958.512 627.7393 1431 4226 --------------------------------------------------------------------------------> sexalph = female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 212 3069.236 622.204 628 4300 --------------------------------------------------------------------------------> sexalph = male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bweight | 225 3233.911 641.5076 693 4553 Notice the first table of descriptive statistics is for those infants with missing data for the sexalph variable. Chapter 2-99 (revision 8 Jan 2012) p. 6 This could be done with a nicer format using the table or tabstat commands, as shown at the end of Chapter 2-1, but the sum command with a bysort is much easier to memorize. Problem 11) Best Choice of Descriptive Statistics to Describe a Variable’s Distribution The variable race/ethnicity is coded as: 1) Caucasian (White) 2) African-American (Black) 3) Asian 4) Native American 5) Pacific Islander What is the level of measurement (measure scale) of this variable? What is the best way to describe it in a “Patient Characteristics” table of a manuscript? The scores simply represent labels or classifications, which have no natural rank ordering. Thus, the level of measure is “nominal” or an “unordered categorical scale.” All that can be done for unordered categories is to report the count, or frequency, for each category, along with the percent of the sample within each category. For this variable, simply show the count and percent. It is also popular to just show the percent, since the count can be derived from the percent and sample size for each group if the reader so choses to. Thus, the entire distribution is put in the table, which with only five categories, the reader should be able to hold in his or her head and visualize correctly the distribution. For the categories with very small percentages, another approach is to combine those categories into an “other” category, which simplifies the presentation. Problem 12) Best Choice of Descriptive Statistics to Describe a Variable’s Distribution The variable systolic blood pressure is coded as actual values of the measurement. What is the variable’s level of measurement? What is the best way to describe it in a “Patient Characteristics” table of a manuscript? The scores look like the integer number system, with equal intervals between the values. The starting value is atmospheric pressure. If the blood pressure increases by 10%, there is 10% more force being exerted, so ratios can be computed with this variable. Thus, the variable is a ratio scale. For statistical analysis, it is generally sufficient to just think of it as an “interval scale”, since the approach to the statistical analysis will almost always be the same as with an interval scale. To describe it, use the mean and standard deviation. If the sample of patients is such that the variable is extremely skewed, use the median and interquartile range, instead. Problem 13) Best Choice of Descriptive Statistics to Describe a Variable’s Distribution The variable sex is scored as Chapter 2-99 (revision 8 Jan 2012) p. 7 1) male 2) female What is the best way to describe it in a “Patient Characteristics” table of a manuscript? The variable has two categories, so it is a dichotomous, or binary, scale. It can also be referred to as “unordered categorical.” To describe it, simply use the count and percent, or just the percent. This actually only needs to be done for one category, either male or female, since the reader can compute the other percent in his or her head without too much trouble. Problem 14) Best Choice of Descriptive Statistics to Describe a Variable’s Distribution The variable New York Heart Association class (NYHA class)(Miller-Davis et al, 2006) is a simple scale that classifies a patient according to how cardiac symptoms impinge on day to day activies. It is scored as Class I) No limitations of physical activity (ordinary physical activity does not cause symptoms) Class II) Slight limitation of physical activity (ordinary physical activity does cause symptoms) Class III) Moderate limitation of activity (comfortable at rest but less than orinary activities cause symptoms) Class IV) Unable to perform any physical activity without discomfort (may be symptomatic even at rest); therefore severe limitation What is the variable’s level of measurement? What is the best way to describe it in a “Patient Characteristics” table of a manuscript? The variable is an “ordinal level of measurement” or “ordered categorical scale.” Since there are only four categories, it should be reported as counts with percents, or just percents. This ignores rank ordering, but the reader would be able to hold the distribution in his or her head just fine. If the percents where shown in side-by-side columns, the reader could even see if the percents were lumping up at the low end for one group versus lumping up at the high end for the other group. If this was not obvious, either or also reporting the median and interquartile range would be helpful. Problem 15) Open up the file births_with_missing.dta in Stata. Compute the frequency tables or descriptive statistics, separately for mothers with and without hypertension, and fill in the following table with the appropriate row labels in column one and the best choice of descriptive statistcs in columns two and three. Table 1. Patient Characteristics Maternal Hypertension Present Chapter 2-99 (revision 8 Jan 2012) Maternal Hypertension Absent p. 8 [N = ] [N = ] Maternal age, yrs Sex of Newborn The descriptive statistics could be generated using, . tab hyp hypertens | Freq. Percent Cum. ------------+----------------------------------0 | 411 85.98 85.98 1 | 67 14.02 100.00 ------------+----------------------------------Total | 478 100.00 . bysort hyp: tab sexalph -> hyp = 0 sex coded | as string | Freq. Percent Cum. ------------+----------------------------------female | 190 49.10 49.10 male | 197 50.90 100.00 ------------+----------------------------------Total | 387 100.00 -> hyp = 1 sex coded | as string | Freq. Percent Cum. ------------+----------------------------------female | 26 40.63 40.63 male | 38 59.38 100.00 ------------+----------------------------------Total | 64 100.00 . bysort hyp: sum matage -> hyp = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------matage | 397 34.08816 3.861201 23 43 -> hyp = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------matage | 66 33.57576 4.278967 24 43 Chapter 2-99 (revision 8 Jan 2012) p. 9 The missing values make reporting the counts problematic, so just showing percents would be the easiest approach. Here, we assume that the missing values follow the same distribution as the nonmissing values. Here is one format. Other formats are also fine—it is just a matter of personal reporting style. Table 1. Patient Characteristics Maternal Hypertension Present [N = 67] Maternal age, yrs Mean (SD) 34 (4) Sex of Newborn, % Male 59 Maternal Hypertension Absent [N = 411] 34 (4) 51 Chapter 2-2. Logic of significance tests Chapter 2-3. Choice of significance test It was shown in Chapter 2-1 that the decision of which descriptive statistic to use was based on the level of measurement of the data. The most informative measure of average and dispersion (such as mean and standard deviation) was selected after determining the level of measurement of the variable. The choice of a test statistic, also called a significance test, is made in a similar way. You choose the test the makes the best use of the information in the variable; that is, it depends on the level of measurement of the variable, and whether the groups being compared are independent (different study subjects) or related (same person measured at least twice). Problem 1) Practice Selecting a Significance Test Using Chapter 2-3. Selecting a significance test before these tests have been introduced in later chapters is in some sense jumping ahead. Still, it is useful at this point in the course to see that the decision is actually quite simple, which removes a lot of mystery about the subject of statistics. You do not even have to know what the tests are to be able to do this. This problem is an exercise to illustrate how easy it is. In this problem, the study is comparing an active treatment (intervention group) to an untreated control group (control group). These groups are different subjects (different people, animals, or specimens). The outcome is an interval scale, and for this analysis, no control for other variables is desired. What is the best significance test to use? Chapter 2-99 (revision 8 Jan 2012) p. 10 Answer: Looking at the table in Chapter 2-3, on page 3, we find the “continuous” row, since “continuous” is another name for “interval scale.” Then we find the “two independent groups” column. The “best” test, or at least an excellent one that is widely accepted as the best choice, is shown, which is the independent groups t-test. This can also be found in Chapter 2-3, on page 7. First find “Two Unrelated Samples”, then find “Interval Scale”, then find “Tests for Location (average)”. There we find independent groups t-test list first. The test listed first is the most popular. Problem 2) Practice Selecting a Significance Test Using Chapter 2-3. In this problem, the study is comparing a baseline, or pre-intervention measurement, to a post-intervention measurement on the same study subjects. There is no control group in the experiment. The outcome is an ordinal scale variable, and for this analysis, no control for other variables is desired. What is the best significance test to use? Answer: Looking at the table in Chapter 2-3, on page 3, we find the “ordered categorical” row, since “ordered categorical” is another name for “ordinal scale.” Then we find the “two correlated groups” column. The “best” test, or at least an excellent one that is widely accepted as the best choice, is shown, which is the Wilcoxon sign rank test. This can also be found in Chapter 2-3, on page 6. First find “Two Related Samples”, then find “Ordinal Scale”. There we find the Wilcoxon signed rank test listed first. The test listed first is the most popular. Chapter 2-4. Comparison of two independent groups Problem 1) Crosstabulation analysis Open the file births_with_missing.dta inside Stata. In preparing to test for an association between hypertension and preterm in the subgroup of females (sex equal 2), first check the minimum expected cell frequency assumption. Do this using: tabulate preterm hyp if sex==2, expect * <or abbreviate to> tab preterm hyp if sex==2, expect Comparing the results to the minimum expected cell frequency rule, should a chi-square test be used to test the association, or should a Fisher’s exact test be used? Chapter 2-99 (revision 8 Jan 2012) p. 11 +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 160 18 | 178 | 155.4 22.6 | 178.0 -----------+----------------------+---------1 | 19 8 | 27 | 23.6 3.4 | 27.0 -----------+----------------------+---------Total | 179 26 | 205 | 179.0 26.0 | 205.0 The minimum expected cell frequency rule is found on page 19 of Chapter 2-4. Specifically, it states the following: Daniel (1995, pp.524-526) in his statistics textbook, cites a rule attributable to Cochran (1954): 2 × 2 table: the chi-square test should not be used if n < 20. If 20 < n < 40, the chi-square test should not be used if any expected frequency is less than 5. When n ≥ 40, three of the expected cell frequencies should be at least 5 and one expected frequency can be as small as 1. In our problem, we a 2 × 2 table with a sample size n=205, which is greater than 40. We see that we have three cells with expected cell frequencies greater than 5, and one cell with an expected cell frequency less than 5, but greater than 1. Thus we satisified the minimum expected cell frequency rule, so a chi-square test can be used. Problem 2) Crosstabulation analysis Compute the appropriate test statistic for the crosstabulation table in Problem 1. Ask for row or column percents, depending on which we would want to report in our manuscript. We would like to report column percents, which provide the percent of mothers with and without hypertension who have the outcome of a low birthweight newborn. We want the chi-square test, which is justified by meeting the minimum expected cell frequency rule, since it provides a more powerful test than a Fisher’s exact test. To get this, we use, tab preterm hyp if sex==2, col chi2 Chapter 2-99 (revision 8 Jan 2012) p. 12 +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 160 18 | 178 | 89.39 69.23 | 86.83 -----------+----------------------+---------1 | 19 8 | 27 | 10.61 30.77 | 13.17 -----------+----------------------+---------Total | 179 26 | 205 | 100.00 100.00 | 100.00 Pearson chi2(1) = 8.0640 Pr = 0.005 We could report this result as, “Mothers with hypertension during pregnancy delivered preterm 31% of the time, while mothers without hypertension had only 11% preterm deliveries (p = 0.005).” Problem 3) Crosstabulation analysis Use the following “immediate” (the data follow immediately after the command name) version of the tabulate command, tabi 5 4 \ 3 10 , expect should a chi-square test be used, or should a Fisher’s exact test be used? +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | col row | 1 2 | Total -----------+----------------------+---------1 | 5 4 | 9 | 3.3 5.7 | 9.0 -----------+----------------------+---------2 | 3 10 | 13 | 4.7 8.3 | 13.0 -----------+----------------------+---------Total | 8 14 | 22 | 8.0 14.0 | 22.0 Applying “If 20 < n < 40, the chi-square test should not be used if any expected frequency is less than 5,” we see that we have two cells with an expected frequency less than 5. We are not allowed any, for this sample size, so a Fisher’s exact test must be used. Chapter 2-99 (revision 8 Jan 2012) p. 13 Problem 4) Comparison of a nominal outcome In Sulkowski (2000) Table 1, the following distribution of race is provided for the two study groups. Race Black White Other Protease Inhibitor Regimen (n = 211) Dual Nucleoside Analog Regimen (n = 87) 151 (72) 57 (27) 3 ( 1) 71 (82) 13 (15) 3( 3) P value 0.02 The problem is to verify the percents and the p value. Use the tabi command to add the three rows of data as part of the command, with each row separated by the carriage return, or new line, symbol “\”. (See an example for two rows of data in the previous problem above.) First check the expected frequencies, and then use a chi-square test or Fisher’s exact test (more correctly called a Fisher-Freeman-Halton test when the table is larger than 2 × 2), as appropriate. Using tabi 151 71\57 13\3 3, expect we get +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | col row | 1 2 | Total -----------+----------------------+---------1 | 151 71 | 222 | 157.2 64.8 | 222.0 -----------+----------------------+---------2 | 57 13 | 70 | 49.6 20.4 | 70.0 -----------+----------------------+---------3 | 3 3 | 6 | 4.2 1.8 | 6.0 -----------+----------------------+---------Total | 211 87 | 298 | 211.0 87.0 | 298.0 We discover 2 cells, or 2/6 = 33%, have expected frequencies < 5. Applying the minimum expected cell frequency rule-of-thumb (Daniel, 1995, pp.524526), quoted in Chapter 2-4, Chapter 2-99 (revision 8 Jan 2012) p. 14 larger than 2 × 2 table (r × c table): the chi-square test can be used if no more than 20% of the cells have expected frequencies < 5 and no cell has an expected frequency < 1. we see that we did not meet the minimum expected cell frequency criteria, since we hae 33% of cells with an expected frequency <5, which is larger than 20%. Thus, we next ask for the Fisher-Freeman-Halton test. Also, we notice that we need column percents to check the percents in Sulkowski’s table, so we specify the “col” option. tabi 151 71\57 13\3 3, col exact We get, | col row | 1 2 | Total -----------+----------------------+---------1 | 151 71 | 222 | 71.56 81.61 | 74.50 -----------+----------------------+---------2 | 57 13 | 70 | 27.01 14.94 | 23.49 -----------+----------------------+---------3 | 3 3 | 6 | 1.42 3.45 | 2.01 -----------+----------------------+---------Total | 211 87 | 298 | 100.00 100.00 | 100.00 Fisher's exact = 0.038 The column percents agreee with what Sulkowski reported, which was Race Black White Other Protease Inhibitor Regimen (n = 211) Dual Nucleoside Analog Regimen (n = 87) 151 (72) 57 (27) 3 ( 1) 71 (82) 13 (15) 3( 3) P value 0.02 The difference in p values is a mystery, but not a critical issue since the conclusion of a significant difference does not change. Chapter 2-99 (revision 8 Jan 2012) p. 15 Problem 5) Comparison of an ordinal outcome Body mass index (BMI) is computed using the equation: body mass index (BMI) = weight/height2 in kg/m2 BMI is frequently recoded into four BMI categories recommended by the National Heart, Lung, and Blood Institute (1998)(Onyike et al., 2003): underweight (BMI <18.5) normal weight (BMI 18.5–24.9) overweight (BMI 25.0–29.9) obese (BMI 30) (How to compute BMI and recode it into these four categories is explained in Chapter 111, if you ever need to do this in your own research.) This recoding converts the data from an interval scale into an ordinal scale, since the categories have order but not equal intervals. To compare two groups on BMI as an ordinal scale, a Wilcoxon-MannWhitney test is appropriate. Suppose the data are: BMI, count (%) Underweight Normal weight Overweight Obese Active Drug (n = 100 Placebo Drug (n = 100) 4 ( 4) 30 (30) 50 (50) 16 (16) 0 ( 0) 20 (20) 45 (45) 35 (35) The ranksum command, which is the Wilcoxon-Mann-Whitney test, does not have an “immediate” form, so we have to convert these data into “individual level” data, where each row of the dataset represents an individual subject. To do this, copy the following into the Stata do-file editor and run it. Chapter 2-99 (revision 8 Jan 2012) p. 16 * --- wrong way to do it --clear input active bmicat count 1 1 4 1 2 30 1 3 50 1 4 16 0 1 0 0 2 20 0 3 45 0 4 35 end expand count tab bmicat active The result is, | active bmicat | 0 1 | Total -----------+----------------------+---------1 | 1 4 | 5 2 | 20 30 | 50 3 | 45 50 | 95 4 | 35 16 | 51 -----------+----------------------+---------Total | 101 100 | 201 We discover that we have a subject in the placebo group (1=Active, 0=Placebo) that does not belong there. To avoid this situation, we must drop any cell count equal to 0 before we expand the data. Here is the correct way to do it: * --- correct way to do it --clear input active bmicat count 1 1 4 1 2 30 1 3 50 1 4 16 0 1 0 0 2 20 0 3 45 0 4 35 end drop if count==0 // always a good idea to add this line expand count tab bmicat active The result is, | active bmicat | 0 1 | Total -----------+----------------------+---------1 | 0 4 | 4 2 | 20 30 | 50 3 | 45 50 | 95 4 | 35 16 | 51 -----------+----------------------+---------Total | 100 100 | 200 Chapter 2-99 (revision 8 Jan 2012) p. 17 Now that we have the dataset correctly created, to run the Wilcoxon-Mann-Whitney test we use, ranksum bmicat , by(active) Two-sample Wilcoxon rank-sum (Mann-Whitney) test active | obs rank sum expected -------------+--------------------------------0 | 100 11305 10050 1 | 100 8795 10050 -------------+--------------------------------combined | 200 20100 20100 unadjusted variance adjustment for ties adjusted variance 167500.00 -23343.59 ---------144156.41 Ho: bmicat(active==0) = bmicat(active==1) z = 3.305 Prob > |z| = 0.0009 <- report his two-tailed p value From just visualizing the data, we notice that the placebo group tends to have greater BMI values. To report this result, we could say something like: BMI was significantly higher in the placeblo group compared to the active drug group (p<0.001)[Table 1]. Problem 6) Comparison of an interval outcome Cut and paste the following into the do-file editor and execute it to set up the dataset. These data represent two groups (1=Patients with Coronary Heart Disease (CHD), 0= Patients without CHD) on an outcome of systolic blood pressure (SBP). clear input chd sbp 1 225 1 190 1 162 1 178 1 158 0 154 0 124 0 128 0 165 0 162 end graph box sbp ,over(chd) Chapter 2-99 (revision 8 Jan 2012) p. 18 220 200 sbp 180 160 140 120 0 1 Just by looking at the boxplot, would you guess a two-sample t-test would have a smaller p value (more significant) than a Wilcoxon-Mann-Whitney test? Looking at the graph, we see the non-CHD patients are skewed to the left, so the mean is pulled into the direction of being smaller than the median. For the CHD patients, the skewness is to the right, so the mean is pulled in the direction of being larger than the median. A comparison of means (t-test) will be more powerful than a comparison of medians (Wilcoxon-Mann-Whitney test) since the means are more separated than the medians. One might wonder, though, if the extra variability created by the skewness will offset this advantage, creating a larger denominator in the t-test statistic, so maybe the Wilcoxon-Mann-Whitney test will win out. The four tests can be run using the following, ttest sbp , by(chd) ttest sbp , by(chd) unequal ranksum sbp , by(chd) permtest2 sbp, by(chd) Chapter 2-99 (revision 8 Jan 2012) p. 19 . ttest sbp , by(chd) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 146.6 8.623224 19.28212 122.6581 170.5419 1 | 5 182.6 12.04824 26.94068 149.1487 216.0513 ---------+-------------------------------------------------------------------combined | 10 164.6 9.207726 29.11739 143.7707 185.4293 ---------+-------------------------------------------------------------------diff | -36 14.81621 -70.16624 -1.833765 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.4298 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.0206 Ha: diff != 0 Pr(|T| > |t|) = 0.0412 Ha: diff > 0 Pr(T > t) = 0.9794 . ttest sbp , by(chd) unequal Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 146.6 8.623224 19.28212 122.6581 170.5419 1 | 5 182.6 12.04824 26.94068 149.1487 216.0513 ---------+-------------------------------------------------------------------combined | 10 164.6 9.207726 29.11739 143.7707 185.4293 ---------+-------------------------------------------------------------------diff | -36 14.81621 -70.79485 -1.205146 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.4298 Ho: diff = 0 Satterthwaite's degrees of freedom = 7.24624 Ha: diff < 0 Pr(T < t) = 0.0221 Ha: diff != 0 Pr(|T| > |t|) = 0.0443 Ha: diff > 0 Pr(T > t) = 0.9779 . ranksum sbp , by(chd) Two-sample Wilcoxon rank-sum (Mann-Whitney) test chd | obs rank sum expected -------------+--------------------------------0 | 5 18.5 27.5 1 | 5 36.5 27.5 -------------+--------------------------------combined | 10 55 55 unadjusted variance adjustment for ties adjusted variance 22.92 -0.14 ---------22.78 Ho: sbp(chd==0) = sbp(chd==1) z = -1.886 Prob > |z| = 0.0593 Chapter 2-99 (revision 8 Jan 2012) p. 20 . permtest2 sbp, by(chd) Fisher-Pitman permutation test for two independent samples chd | obs mean std.dev. -------------+--------------------------------0 | 5 146.6 19.282116 1 | 5 182.6 26.940676 -------------+--------------------------------combined | 10 164.6 29.117387 mode of operation: exact (complete permutation) Test of hypothesis Ho: sbp(chd==0) >= sbp(chd==1) : Test of hypothesis Ho: sbp(chd==0) <= sbp(chd==1) : Test of hypothesis Ho: sbp(chd==0) == sbp(chd==1) : p=.02380952 (one-tailed) p=.98412698 (one-tailed) p=.04761905 (two-tailed) We find that two-sample t-test which assumes equal variances has the smallest p value. If you wonder if the skewness and differences in variances (or standard deviation differences) are severe enough to invalid the two-sample t-test which assumes equal variances, it is comforting to see the significance is confirmed by the Fisher-Pitman permutation test which has neither the normality or homogeneity of variance assumptions. This result illustrates the robustness of the t-test to these two assumptions, giving a reasonable p value even though the assumptions might be called into question. Chapter 2-5. Basics of power analysis Problem 1) sample size determination for a comparison of two means You have pilot data taken from Chapter 2-4, problem 6 above. clear input chd sbp 1 225 1 190 1 162 1 178 1 158 0 154 0 124 0 128 0 165 0 162 end Designing a new study to test this difference in mean SPB, between patients with CHD and without CHD, given these standard deviations, what sample size do you need to have 80% power, using a two-sided alpha 0.05 level comparison? After creating this dataset in Stata, to obtain the means and standard deviations, you can use, Chapter 2-99 (revision 8 Jan 2012) p. 21 ttest sbp , by(chd) * <or> bysort chd: sum sbp . ttest sbp , by(chd) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 146.6 8.623224 19.28212 122.6581 170.5419 1 | 5 182.6 12.04824 26.94068 149.1487 216.0513 ---------+-------------------------------------------------------------------combined | 10 164.6 9.207726 29.11739 143.7707 185.4293 ---------+-------------------------------------------------------------------diff | -36 14.81621 -70.16624 -1.833765 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.4298 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.0206 Ha: diff != 0 Pr(|T| > |t|) = 0.0412 Ha: diff > 0 Pr(T > t) = 0.9794 . * <or> . bysort chd: sum sbp -------------------------------------------------------------------------------> chd = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------sbp | 5 146.6 19.28212 124 165 -------------------------------------------------------------------------------> chd = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------sbp | 5 182.6 26.94068 158 225 To obtain the required sample size, use, sampsi 146.6 182.6 , sd1(19.3) sd2(26.9) power(.8) Chapter 2-99 (revision 8 Jan 2012) p. 22 Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 146.6 182.6 19.3 26.9 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 7 7 We see that we require n=7 subjects per group. Problem 2) z score (effect size) approach to power analysis Cuellar and Ratcliffe (2009) include the following sample size determination paragraph in their article, “The study was powered to include 40 participants, 20 randomized to each group. Randomization would include an equal number of men and women in each group. Twenty particpants per group were needed to detect the required moderate effect size of 0.9 assuming 80% power, alpha of 0.05, and using a Student’s t-test.” Come up with the “sampsi” command that verifies the sample size computation described in their paragraph. (hint: review the “What to do if you don’t know anything” section of Chapter 2-5). The authors were stating that their effect size was a 0.9 SD difference in means. Given that z-scores have a mean of 0 and a SD of 1, you use SD = 1 for both groups and a mean of 0 for one of the groups. You then assume the distribution is shifted by a mean difference of 0.9SD =0.9(1) = 0.9 for the other group, so use a mean of 0.9 for the other group. The sampsi command is thus, sampsi 0 .9 , sd1(1) sd2(1) power(.8) Chapter 2-99 (revision 8 Jan 2012) p. 23 Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 0 .9 1 1 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 20 20 Problem 3) Verifying the z score approach to sample size determination and power analysis is legitimate In Chapter 2-5, the z score approach was presented, but without a demonstration that it actually gives the same answer as when the data are expressed in their original measurement scale. To verify the approach is legitimate, we will first a) verify that a zscore transformed variable has a mean of 0 and standard deviation of 1. Second, b) we will verify a t-test on data transformed into z-scores gives the same p value as when the original scale is used. Third, c) we will verify the sample size and power calculations are identical for a z-score transformed variable and the variable in its original scale. Note: Verifying with an example is not a formal mathematic proof; but if it is true, it should work for any example we try. We can think of it as a demonstration, or verification, rather than a proof. That is good enough for our purpose. We will use the same dataset used above in Ch 2-5 problem 1, which is duplicated here. Cut-and-paste this dataset into the do-file editor, highlight it with the mouse, and then hit the last icon on the right to execute it. clear input chd sbp 1 225 1 190 1 162 1 178 1 158 0 154 0 124 0 128 0 165 0 162 end sum sbp Chapter 2-99 (revision 8 Jan 2012) p. 24 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------sbp | 10 164.6 29.11739 124 225 a) verify that a z-score transformed variable has a mean of 0 and standard deviation of 1. Using this mean and standard deviation (SD), generate a variable that contains the z scores for systolic blood pressure (sbp), using the formula z X X SD [Hint: to compute z=(a-b)/c, use: generate z = (a-b)/c ] Then, use the command, summarize, or abbreviated to sum, to obtain the means and SDs for sbp and your new variable z. If you do this correctly, the mean of z will be 0, and the SD of z will be 1. Using, generate z=(sbp-164.6)/29.11739 sum sbp z Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------sbp | 10 164.6 29.11739 124 225 z | 10 1.02e-09 .9999999 -1.394356 2.074362 Or, using the “extensions to generate command”, egen, along with the function std to get the standardized scores, or z-scores, where the egen command does exactly the same computation, drop z // use this only if have already generated z egen z = std(sbp) sum sbp z Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------sbp | 10 164.6 29.11739 124 225 z | 10 1.02e-08 1 -1.394356 2.074362 (Note: the in-line comment “//” only works in the do-file editor—it will give an error message is used in the Command window.) We see that the variable z has a mean of 0, except for rounding error, and a SD of 1, which is a known property of the z-score. Chapter 2-99 (revision 8 Jan 2012) p. 25 b) verify a t-test on data transformed into z-scores gives the same p value as when the original scale is used Using an independent sample t-test, compare the coronary heart disease, chd, group to the healthy group on the outcome systolic blood pressure, sbp. Then, repeat the t-test for the z-score transformed systolic blood pressure. Notice that the p values are identical for both t-tests. Using, ttest sbp , by(chd) ttest z , by(chd) . ttest sbp , by(chd) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 146.6 8.623224 19.28212 122.6581 170.5419 1 | 5 182.6 12.04824 26.94068 149.1487 216.0513 ---------+-------------------------------------------------------------------combined | 10 164.6 9.207726 29.11739 143.7707 185.4293 ---------+-------------------------------------------------------------------diff | -36 14.81621 -70.16624 -1.833765 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.4298 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.0206 Ha: diff != 0 Pr(|T| > |t|) = 0.0412 Ha: diff > 0 Pr(T > t) = 0.9794 . ttest z , by(chd) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 -.6181873 .2961538 .66222 -1.440442 .2040674 1 | 5 .6181874 .4137815 .9252436 -.5306543 1.767029 ---------+-------------------------------------------------------------------combined | 10 1.02e-08 .3162278 1 -.7153569 .7153569 ---------+-------------------------------------------------------------------diff | -1.236375 .508844 -2.409771 -.0629783 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.4298 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.0206 Ha: diff != 0 Pr(|T| > |t|) = 0.0412 Ha: diff > 0 Pr(T > t) = 0.9794 We see that the p values are identical for the two t-tests. This verifies that the power and sample size determination will not be affected by a z-score transformation, since both can be thought of as a function of the p value. Chapter 2-99 (revision 8 Jan 2012) p. 26 c) verify the sample size and power calculations are identical for a z-score transformed variable and the variable in its original scale Using the means and SDs from the first t-test, compute the required sample size for power of 0.80. Do the same for the second t-test. Then, using a sample size of n=7 per group, compute the power for these same means and SDs. You should get the same result for both measurement scales. Using, sampsi sampsi * sampsi sampsi 146.6 182.6 ,sd1(19.28212) sd2(26.94068) power(.80) -.6181873 .6181874 , sd1(.66222) sd2(.9252436) power(.80) 146.6 182.6 ,sd1(19.28212) sd2(26.94068) n1(7) n2(7) -.6181873 .6181874 , sd1(.66222) sd2(.9252436) n1(7) n2(7) . sampsi 146.6 182.6 ,sd1(19.28212) sd2(26.94068) power(.80) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 146.6 182.6 19.2821 26.9407 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 7 7 . sampsi -.6181873 .6181874 , sd1(.66222) sd2(.9252436) power(.80) Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = 0.0500 = 0.8000 = -.618187 = .618187 = .66222 = .925244 = 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 7 7 We see we got the same result for the sample size determination. Chapter 2-99 (revision 8 Jan 2012) p. 27 . sampsi 146.6 182.6 ,sd1(19.28212) sd2(26.94068) n1(7) n2(7) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha m1 m2 sd1 sd2 sample size n1 n2 n2/n1 = = = = = = = = 0.0500 146.6 182.6 19.2821 26.9407 7 7 1.00 (two-sided) Estimated power: power = 0.8199 . sampsi -.6181873 .6181874 , sd1(.66222) sd2(.9252436) n1(7) n2(7) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha m1 m2 sd1 sd2 sample size n1 n2 n2/n1 = 0.0500 = -.618187 = .618187 = .66222 = .925244 = 7 = 7 = 1.00 (two-sided) Estimated power: power = 0.8199 We see we got the same result for the power analysis. Clarification of the z-score approach This is not quite how we did it in the chapter, however. In the chapter, we used a mean of 0 and SD of 1 for both groups, and then we used some fraction or multiple of the SD=1 for the effect size. In that approach, we assume that both groups have the same SD and only differ in their means. Suppose that the data we used above come from pilot data, or previous published data, so we can use these means and SDs in our sample size determination for a larger study. Chapter 2-99 (revision 8 Jan 2012) p. 28 -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 5 146.6 8.623224 19.28212 122.6581 170.5419 1 | 5 182.6 12.04824 26.94068 149.1487 216.0513 ---------+-------------------------------------------------------------------combined | 10 164.6 9.207726 29.11739 143.7707 185.4293 ---------+-------------------------------------------------------------------diff | -36 14.81621 -70.16624 -1.833765 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -2.4298 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.0206 Ha: diff != 0 Pr(|T| > |t|) = 0.0412 Ha: diff > 0 Pr(T > t) = 0.9794 We see for our control group, we have n=5, mean=146.6, SD=19.28. For our treatment group, we have n=5, mean=182.6, and SD=26.94. For estimate of a common, or same, SD, what should we use? Conservatively, we could use the larger of the two, or common SD = 26.94. We do not use the combined SD from the t-test output, which is SD=29.12, as the t-test does not use that value and it is unnecessarily large. Even the larger SD of the two groups, SD=26.94, is unnecesarily large. The power of the t-test will depend on the SD that the t-test will use, which is a weighted average of the two SDs. The formula for this, for the two sample t-test with equal variances, called the pooled standard deviation, is (Rosner, 2006, p.305), (n1 1) s12 (n2 1) s22 s n1 n2 2 Using this formula, we calculate the pooled SD using, display sqrt(((5-1)*19.28212^2+(5-1)*26.94068^2)/(5+5-2)) 23.426485 The mean difference from the t-test output was, -36, or 36 if you substract in the opposite direction. Just using 36, then, is okay, since you get an identical result for the two-sided comparison, and a two-sided comparison is almost universally what you use. This difference expressed in standard deviation units is display 36/23.426485 1.5367222 Consistent with how it was done in Chapter 2-5, we use the fact that z-scores have means of 0 and SDs of 1. So, we specify one mean as 0, the other mean as the difference in SD units, which is 1.5367222, and use SD=1 for both groups. We then compute the required sample size for power = 0.80 using, sampsi 0 1.5367222 , sd1(1) sd2(1) power(.80) Chapter 2-99 (revision 8 Jan 2012) p. 29 Estimated sample size for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha power m1 m2 sd1 sd2 n2/n1 = = = = = = = 0.0500 0.8000 0 1.53672 1 1 1.00 (two-sided) Estimated required sample sizes: n1 = n2 = 7 7 We see this is identical to the n=7 in each group calculated using the original scale of the variable above. We then verify the power calculation, using, sampsi 0 1.5367222 , sd1(1) sd2(1) n1(7) n2(7) Estimated power for two-sample comparison of means Test Ho: m1 = m2, where m1 is the mean in population 1 and m2 is the mean in population 2 Assumptions: alpha m1 m2 sd1 sd2 sample size n1 n2 n2/n1 = = = = = = = = 0.0500 0 1.53672 1 1 7 7 1.00 (two-sided) Estimated power: power = 0.8199 We see that the power = 0.8199 is identical to the power computed above using the original scale of the variable. Chapter 2-6. More on levels of measurement Chapter 2-7. Comparison of two paired groups Chapter 2-8. Multiplicity and the Comparison of 3+ Groups Problem 1) working with the formulas for the Bonferroni procedure, Holm procedure, and Hochberg procedure for multiplicity adjustment These are three popular procedures with formulas that are simply enough that you can do them by hand. The procedures are described in Ch 2-8, pp. 9-10. Chapter 2-99 (revision 8 Jan 2012) p. 30 For Bonferroni adjusted p values, you simply multiply each p value by the number of comparisons made. If this results in a p value greater than 1, an anomoly, you set the adjusted p value to 1. For Holm adjusted p values, you first sort the p values from smallest to largest. Than you multiple by smallest p value by the number of comparison, the next smallest by the number of comparisons minus 1, and so on. Do this same thing for Hochberg adjusted p values. If a Holm adjusted p value becomes larger than the next adjusted p value, an anomoly since this conflicts with the rank ordering of the unadjusted p values, you carry the previous adjusted p value forward. For Hochberg, the anomoly adjustment is to carry the subsequent adjusted p value backward. If the Holm adjusted p value exceeds 1, you set it to 1. Fill in the following table, doing the computations and adjustments in your head. Sorted Bonferroni Holm Unadjusted Adjusted Adjusted P value P value P value (before anomoly correction) 0.020 0.060 0.060 0.025 0.075 0.050 0.040 0.120 0.040 Holm Adjusted P value (after anomoly correction) 0.060 0.060 0.060 Hochberg Adjusted P value (before anomoly correction) 0.060 0.050 0.040 Hochberg Adjusted P value (after anomoly correction) 0.040 0.040 0.040 Use the mcpi command, after installing it if necessary as described in Ch 2-8, to check your answers. The mcpi command will have the form, mcpi .020 .025 .040 SORTED ORDER: before anomaly Unadj ---------------------P Val TCH Homml Finnr 0.0200 0.034 0.040 0.059 0.0250 0.043 0.040 0.037 0.0400 0.068 0.040 0.040 corrected Adjusted --------------------------Hochb Ho-Si Holm Sidak Bonfr 0.060 0.059 0.060 0.059 0.060 0.050 0.049 0.050 0.073 0.075 0.040 0.040 0.040 0.115 0.120 SORTED ORDER: anomaly corrected (1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1 (2) If Finner or Hol-Sid or Holm P < preceding smaller P (illogical) then set to preceding P (3) Working from largest to smallest, if Hochberg preceding smaller P > P then set preceding smaller P to P Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0200 0.034 0.040 0.059 0.040 0.059 0.060 0.059 0.060 0.0250 0.043 0.040 0.059 0.040 0.059 0.060 0.073 0.075 0.0400 0.068 0.040 0.059 0.040 0.059 0.060 0.115 0.120 ORIGINAL ORDER: anomaly corrected Chapter 2-99 (revision 8 Jan 2012) p. 31 Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0200 0.034 0.040 0.059 0.040 0.059 0.060 0.059 0.060 0.0250 0.043 0.040 0.059 0.040 0.059 0.060 0.073 0.075 0.0400 0.068 0.040 0.059 0.040 0.059 0.060 0.115 0.120 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons KEY: TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr = Tukey-Ciminera-Heyse procedure (use TCH only with highly correlated comparisons) = Hommel procedure = Finner procedure = Hochberg procedure = Holm-Sidak procedure = Holm procedure = Sidak procedure = Bonferroni procedure This exercise illustrates the conservativeness of the Bonferroni procedure, which lost significance for all three p values. It also illustrates why Hochberg is more popular, and more powerful, than the Holm procedure—the Holm procedure lost all of the significance, as well, while Hochberg kept all three p values significant. Problem 2) a published example of using the Bonferroni procedure Kumara et al. (2009) compared five protein assays against a preop baseline. In their Statistical Methods, they state, “In regards to the protein assays, 5 comparisons (all vs. preop baseline) were carried out for each parameter, thus, a Bonferroni adjustment was made….” These authors had several parameters, with five comparisons made for each parameter. The adjustment for the five comparisons was made separately for each parameter, which is the popular way to do it. One might consider if an adjustment needs to be made for all of the parameters simultaneously, so k parameters × 5 comparisons, which would be a large number. Alternatively, you could adjust for all p values reported in the paper. A line has to be drawn somewhere, or all significance would be lost in the paper. Statisticians have drawn the line at a “family” of comparisons, so arises the term “familywise error rate” or FWER. Each parameter represents a family of five related comparisons, so an adjustment is made for those five comparisons to control the FWER. This adjustment is done separately for each parameter, which makes sense, since each parameter is a separate question to study. In the Kumara (2009, last paragraph), we find, “The mean preopplasma level for the 105 patients was 164 ± 146 pg/mL. Significant elevations, as per the Bonferoni criteria, were noted on POD 5 (355 ± 275 pg/mL, P = 0.002) and for the POD 7 to 13 time period (371 ± 428 pg/mL, P = 0.001) versus the preop results (Fig.1). Although VEGF levels were eleavated for the POD 14 to 20 (289 ± 297 pg/mL; vs. preop, P = 0.036) and POD 21 to 27 (244 ± 297 pg/mL; vs. preop, P = 0.048), as per the Bonferoni correction, these differences were not significant. By the second month after surgery the mean VEGF level was near baseline….” Chapter 2-99 (revision 8 Jan 2012) p. 32 This is an example of where two significant p values were lost after adjusting for multiple comparions. The authors reported the results this way, showing the unadjusted p values, while mentioning the adjustment declared them nonsignificant, apparently because they thought the effects were real and they wanted to lead the reader in that direction. Even in their Figure 1 they denote the results as significant. This is a good illustration of investigators being frustrated by “I had significant but lost it due to the stupid multiple comparison adjustment.” It could easily be argued that the authors took the right approach, since not informing the reader might produce a Type II error (false negative conclusion). There is no universal consensus on this point. The exercise is to see if applying some other multiple comparison adjustment would have saved the significance, since we know the Bonferroni procedure is too conservative. For the fifth p value, the last quoted sentence, it was clearly greater than 0.05, so just use 0.50. It makes no difference what value >0.05 we choose, since multiple comparison adjustments cannot create significance that was not there before adjustment. Using 0.50, then, along with the four other p values in the quoted paragraph, use the mcpi command to see if significance would have been saved by one of the other procedures. mcpi .002 .001 .036 .048 0.500 Chapter 2-99 (revision 8 Jan 2012) p. 33 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0020 0.004 0.008 0.005 0.008 0.008 0.008 0.010 0.010 0.0010 0.002 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.0360 0.079 0.072 0.059 0.096 0.104 0.108 0.167 0.180 0.0480 0.104 0.096 0.060 0.096 0.104 0.108 0.218 0.240 0.5000 0.788 0.500 0.500 0.500 0.500 0.500 0.969 1.000 ----------------------------------------------------------------*Adjusted for 5 multiple comparisons We discover that none of the FWER procedures saved the day in the strict sense of adjusted p values < 0.05. However, we observe that Finner’s procedure came close, with 0.059 and 0.060, in place of 0.180 and 0.240 that Bonferroni gave. This would have allowed the reader to argue a “marginally significant” result, which is usually reported as a “trend toward significance.” This example also illustrates how different multiple comparison procedures can be the winner from situation to situation. Which procedure will be the winner depends on the pattern of the p values. Even statisticians are at a lose of predicting a priori which procedure will be the winner, making a “pre-specified” analysis a nerve racking practice to follow. Problem 3) Analyzing Data with Multiple Treatments An animal experiment was performed to test the effectiveness of a new drug. The researcher was convinced the drug was effective, but he was not sure which carrier solution would enhance drug delivery. (The drug is disolved into a the carrier solution so it can be delivered intravenously.) Three candidate carriers were considered, carrier A, B, and C. The experiment then involved four groups: treat: 1 = inert carrier only (control group) 2 = active drug in carrier A 3 = active drug in carrier B 4 = active drug in carrier C The researcher wanted to conclude the drug was effective if any of the active drug groups were significantly greater on the response variable than the control group. That is, the decision rule was: Conclude effectiveness if (treatment 1 > control) or ( treatment 2 > control) or ( treatment 3 > control). The decision rule, or “win strategy” fits the multiple comparison situation where you want to control the family-wise error rate (FWER), which is the typical situation that researchers learn about in statistics courses. That is, you want to use multiple comparisons to arrive at a single conclusion, and you want to keep the Type I error at alpha ≤ 0.05. Chapter 2-99 (revision 8 Jan 2012) p. 34 To create the dataset, copy the following into the Stata do-file editor and execute it. clear set seed 999 set obs 24 gen treat = 1 in 1/6 replace treat = 2 in 7/12 replace treat = 3 in 13/18 replace treat = 4 in 19/24 gen response = invnorm(uniform())*4+1*treat bysort treat: sum response -> treat = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------response | 6 .2875054 3.057285 -5.066414 2.869501 -> treat = 2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------response | 6 1.772157 2.48272 -2.487267 4.081223 -> treat = 3 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------response | 6 4.119424 7.105336 -3.477194 15.84198 -> treat = 4 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------response | 6 6.501425 2.198203 2.682323 9.369428 For sake of illustration, let’s now perform a oneway analysis of variance, which many statistics instructors still erroneously teach is a necessary first step. oneway response treat, tabulate Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 133.575257 3 44.5250857 2.51 0.0876 Within groups 354.143942 20 17.7071971 -----------------------------------------------------------------------Total 487.719199 23 21.2051826 The oneway ANOVA is not significant (p = 0.088), so the statistics instructors who advocate this was a necessary first step would say you cannot go any further. You would then conclude that the drug was not effective. As was pointed out in Chapter 2-8, in the section called “Common Misconception of Thinking Analysis of Variance (ANOVA) Must Precede Pairwise Comparisons,” there is no reason to do this. Many statisticians are aware that the ANOVA test is ultra conservative, and so they stay away from it when testing treatment effects. Chapter 2-99 (revision 8 Jan 2012) p. 35 A more correct and more powerful approach is to bypass the ANOVA test, going straight to making the three specific individual comparisons of interest, which are each of the three active drug groups (2, 3, and 4) with the control group (1). A multiple comparison procedure is applied to these three comparisons to control the family-wise error rate. The homework problem is to perform an independent sample t-test between groups 1 and 2, 1 and 3, and 1 and 4, and then adjust the three obtained two-sided p values using the mcpi command. You will need to include an “if” statement in the ttest command, as was shown in Chapter 2-8 (see pp. 44-45). Apply Hommell’s procedure, which is one of the methods used by the mcpi command. From the obtained results, should you conclude the drug is effective? ttest response if treat==1 | treat==2, by(treat) ttest response if treat==1 | treat==3, by(treat) ttest response if treat==1 | treat==4, by(treat) mcpi .3775 .2528 .0024 . ttest response if treat==1 | treat==2, by(treat) Ha: diff < 0 Pr(T < t) = 0.1888 Ha: diff != 0 Pr(|T| > |t|) = 0.3775 Ha: diff > 0 Pr(T > t) = 0.8112 . ttest response if treat==1 | treat==3, by(treat) Ha: diff < 0 Pr(T < t) = 0.1264 Ha: diff != 0 Pr(|T| > |t|) = 0.2528 Ha: diff > 0 Pr(T > t) = 0.8736 . ttest response if treat==1 | treat==4, by(treat) Ha: diff < 0 Pr(T < t) = 0.0012 Ha: diff != 0 Pr(|T| > |t|) = 0.0024 Ha: diff > 0 Pr(T > t) = 0.9988 . mcpi .3775 .2528 .0024 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.3775 0.560 0.377 0.377 0.377 0.442 0.506 0.759 1.000 0.2528 0.396 0.377 0.354 0.377 0.442 0.506 0.583 0.758 0.0024 0.004 0.007 0.007 0.007 0.007 0.007 0.007 0.007 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons The p value for the hypothesis test of effectiveness is the smallest adjusted p value from the three comparisons. In this dataset, you would conclude: The drug was demonstrated to be effective (p = 0.007). Chapter 2-9. Correlation Chapter 2-10. Linear regression Problem 1) regression equation The regression output, among other things, shows the equation of the regression line. From simple algebra, we know the equation of a straight line is: Chapter 2-99 (revision 8 Jan 2012) p. 36 y a bx where y is the outcome, or dependent variable, x is the predictor, or independent variable, a is the y-intercept, and b is the slope of the line. Fitting this equation is called “simple linear regression.” Extending this equation to three predictor variables, the regression equation is: y a b1 x1 b2 x2 b3 x3 Fitting this equation is called “multivariable linear regression” to signify that more than one predictor variable is included in the equation. Using the FEV dataset, fev.dta, predicting FEV by height, regress fev height -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------height | .1319756 .002955 44.66 0.000 .1261732 .137778 _cons | -5.432679 .1814599 -29.94 0.000 -5.788995 -5.076363 ------------------------------------------------------------------------------ We can find the intercept and slope from the coefficient column of the regression table and write the linear equation as, fev = -5.432679 + 0.1319756(height) Listing these variables for the first subject, list fev height in 1 +----------------+ | fev height | |----------------| 1. | 1.708 57 | +----------------+ To apply the prediction equation to predict FEV for this first subject, we can use the display command, where “*” denotes multiplication, display -5.432679 + 0.1319756*57 2.0899302 We can get Stata to provide the predicted values from applying the regression equation, using the predict command. In the following example, we use “pred_fev” as the variable we choose to store the predicted values in. The predict command applies the equation from the last fitted model. Chapter 2-99 (revision 8 Jan 2012) p. 37 predict pred_fev list fev height pred_fev in 1 +---------------------------+ | fev height pred_fev | |---------------------------| 1. | 1.708 57 2.089929 | +---------------------------+ The predicted value that Stata came up with is slightly different from what we got with the display command because it is using more decimal places of accuracy. Now that we see how it works, the homework problem is to fit a multivariable model for FEV with height and age as the predictor variables. Use the display command to predict FEV for the first subject. Then, use the predict command to check your answer. In other words, take all of the Stata commands we used above, regress fev height list fev height in 1 display -5.432679 + 0.1319756*57 capture drop pred_fev predict pred_fev list fev height pred_fev in 1 and modify them for the two predictors, height and age, used together. The homework solution is the following commands: regress fev height age list fev height age in 1 display -4.610466 + 0.1097118*57 + 0.0542807*9 capture drop pred_fev predict pred_fev list fev height age pred_fev in 1 . regress fev height age Source | SS df MS -------------+-----------------------------Model | 376.244941 2 188.122471 Residual | 114.674892 651 .176151908 -------------+-----------------------------Total | 490.919833 653 .751791475 Number of obs F( 2, 651) Prob > F R-squared Adj R-squared Root MSE = 654 = 1067.96 = 0.0000 = 0.7664 = 0.7657 = .4197 -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------height | .1097118 .0047162 23.26 0.000 .100451 .1189726 age | .0542807 .0091061 5.96 0.000 .0363998 .0721616 _cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085 -----------------------------------------------------------------------------. list fev height age in 1 +----------------------+ | fev height age | |----------------------| Chapter 2-99 (revision 8 Jan 2012) p. 38 1. | 1.708 57 9 | +----------------------+ . display -4.610466 + 0.1097118*57 + 0.0542807*9 2.1316329 . capture drop pred_fev . predict pred_fev (option xb assumed; fitted values) . list fev height age pred_fev in 1 +---------------------------------+ | fev height age pred_fev | |---------------------------------| 1. | 1.708 57 9 2.131634 | +---------------------------------+ Problem 2) reporting the results Using the regression model output from Problem 1, which is Source | SS df MS -------------+-----------------------------Model | 376.244941 2 188.122471 Residual | 114.674892 651 .176151908 -------------+-----------------------------Total | 490.919833 653 .751791475 Number of obs F( 2, 651) Prob > F R-squared Adj R-squared Root MSE = 654 = 1067.96 = 0.0000 = 0.7664 = 0.7657 = .4197 -----------------------------------------------------------------------------fev | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------height | .1097118 .0047162 23.26 0.000 .100451 .1189726 age | .0542807 .0091061 5.96 0.000 .0363998 .0721616 _cons | -4.610466 .2242706 -20.56 0.000 -5.050847 -4.170085 ------------------------------------------------------------------------------ write a sentence that could be used to interpret the the association of height with FEV. Solution: There are many ways to write this, so there are many correct answers to this problem. Here are two possibilities: a) brief statement: Height was associated with FEV (p<0.001), controlling for age. b) detailed statement: Height was significantly associated with FEV, after controlling for age [adjusted mean increase in FEV per one inch of height, 0.11, 95%CI (0.10, 0.12), p<0.001]. Problem 3) Are the signs (positive or negative) of correlations transitive? When we speak of correlation, referring to the regression coefficient or referring to the correlation coefficient is analogous, since the correlation coefficient is simply the regression coefficient computed after transforming the data to standardized scores, or zscores. Both approaches test for a correlation (an association). When the regression coefficient or the correlation coefficient is positive, we say the variables are “positively correlated”, so as you increase on one variable, you also increase on the other. When they are negative, we say the variables are “negatively correlated”, so as you increase on one Chapter 2-99 (revision 8 Jan 2012) p. 39 variable, you decrease on the other. It does not matter which variable is the dependent or independent variable when you test for a correlation or when you interpret whether it is negatively or positively correlated. In mathematics, the transitive property of equality is: If A=B and B=C, then A=C. Similarly, the transitive property of inequality is: If A≥B and B≥C, then A≥C. Suppose you analyze your data, computing correlation coefficients between variables A, B, and C, and you get the following result: A vs B: Pearson r = +0.75 B vs C: Pearson r = +0.70 A vs C: Pearson r = -0.50 Your co-investigator looks at these results and says to you, “This cannot be right. If B increases with increasing A, and C increases with increasing B, then it is impossible for C to decrease with increasing A. Think about it. If I put a beaker of water on a flame, the temperature of the water increases when the temperature of the breaker increases (A increase → B increase). If I put a stone in the water, the temperature of the stone goes up when the temperature of the water goes up (B increase → C increase). Now you are trying to tell me that when the temperature of the breaker increases the temperature of the stone decreases? (A increase → C decrease).” What your co-investigator just did was assume that the sign of the correlation coefficient exhibits the transitive property. Is the co-investigator correct that you must have made a mistake? (hint: You will not find this in the course manual chapters on regression or correlation. It is just an exercise in reasoning. It is included here as a homework problem because this type of “reasoning” is quite common when researchers look for associations.) Solution. The answer is that the co-investigator is making a logical mistake. The transitive property that we see in mathematics for equality and inequality does not hold in other settings, in general. The sign of the correlation coefficient does not have the transitive property. Here is a simple real-life example of the transitive property not holding: Bill loves Jane. Jane loves Bob. Does this imply Bill loves Bob? No, actually Bill hates Bob because he is jealous of Jane’s love for Bob. Chapter 2-11. Logistic regression and dummy variables Chapter 2-99 (revision 8 Jan 2012) p. 40 Chapter 2-12. Survival analysis: Kaplan-Meier graphs, Log-rank Test, and Cox regression Chapter 2-13. Confidence intervals versus p values and trends toward Significance Chapter 2-14. Pearson correlation coefficient with clustered data Chapter 2-15. Equivalence and noninferiority tests Problem 1) Difference in two proportions noninferiority test Reboli et al. (N Engl J Med, 2007) conducted a randomized, double-blind, noninferiority trial to test their hypothesis that anidualfungin, a new echinocandin, is noninferior to fluconazole for the treatment of invasive candidiasis. In their article, they stated that their statistical method was, “The primary analysis in this noninferiority trial was a two-step comparison of the rate of global success between the two study groups at the end of intravenous therapy. A two-sided 95% confidence interval was calculated for the true difference in efficacy (the success rate with anidulafungin minus that with fluconazole). In the first step, noninferiority was considered to be shown if the lower limit of the two-sided 95% confidence interval was greater than -20 percentage points. In the second step, if the lower limit was greater than 0, then anidualfungin was considered to be superior in the strict sense to fluconazole.” To set up the data reported by Reboli et al., cut-and-paste the following into the Stata dofile editor, highlight it, and double click on the last icon on the do-file menu bar to execute it. clear all set obs 245 gen anidulafungin=1 in 1/127 replace anidulafungin=0 in 128/245 recode anidulafungin 0=1 1=0 ,gen(fluconazole) label variable fluconazole // turn off variable label gen globalresponse=1 in 1/96 replace globalresponse=0 in 97/127 replace globalresponse=1 in 128/198 replace globalresponse=0 in 199/245 label define anidulafunginlab 1 "1. anidulafungin" /// 0 "0. fluconazole" label values anidulafungin anidulafunginlab label define fluconazolelab 1 "1. fluconazole" /// 0 "0. anidulafungin" label values fluconazole fluconazolelab label define globalresponselab 1 "1. Success" 0 "0. Failure" label values globalresponse globalresponselab tab globalresponse anidulafungin, col tab globalresponse fluconazole, col Chapter 2-99 (revision 8 Jan 2012) p. 41 Part a) Using the noninferiority margin of -20%, where anidulafungin could have an absolute percent success of 20 points (20%) less than fluconazole, test the noninferiority hypothesis using with the appropriate confidence interval. Hint: Use the prtest command. Depending on which direction you want to compute the difference, use either the anidulafungin group variable or the fluconazole group variable. Part b) If justified following the noninferiority analysis, test for superiority of anidulafungin to fluconazole. Solution. In Stata, an easy way to get a confidence interval for the difference in two proportions is the prtest command. Normally in a data analysis, you use an indicator variable for the new therapy, which is anidulafungin, making the standard therapy, fluconazole, the referent group. So, most likely, first you would try, prtest globalresponse , by(anidulafungin) Two-sample test of proportions 0. fluconazo: Number of obs = 118 1. anidulafu: Number of obs = 127 -----------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------0. fluconazo | .6016949 .0450666 .5133659 .6900239 1. anidulafu | .7559055 .0381163 .6811989 .8306121 -------------+---------------------------------------------------------------diff | -.1542106 .0590242 -.2698959 -.0385253 | under Ho: .0595634 -2.59 0.010 -----------------------------------------------------------------------------diff = prop(0. fluconazo) - prop(1. anidulafu) z = -2.5890 Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.0048 Ha: diff != 0 Pr(|Z| < |z|) = 0.0096 Ha: diff > 0 Pr(Z > z) = 0.9952 If we report this difference, -0.15, or -15%, it would be confusing to the reader, because we want to make a statement about anidulafungin relative to fluconazole, and this difference makes anidulafungin look like it had 15% less success than fluconazole. Actually, anidulafungin has 15% more success. The problem is that the prtest command lists the groups, or creates table rows, in numerical order and then subtracts the second row from the first. We want the subtraction in the opposite order. (The ttest command in Stata does the same thing for comparing means.) We can still use this output, by 1) changing the sign on the difference to make it 0.154, rather than -0.154, and then 2) change the signs of the two confidence limits and reverse their order, so (-0.270 , -0.039) becomes (0.039 , 0.270). Finally, converting to percentages makes this result 15.4% and (3.9% , 27.0%). Alternatively, we can use the indicator variable that is coded in the opposite order so that the subtraction is in the desired direction, prtest globalresponse , by(fluconazole) Chapter 2-99 (revision 8 Jan 2012) p. 42 Two-sample test of proportions 0. anidulafu: Number of obs = 127 1. fluconazo: Number of obs = 118 -----------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------0. anidulafu | .7559055 .0381163 .6811989 .8306121 1. fluconazo | .6016949 .0450666 .5133659 .6900239 -------------+---------------------------------------------------------------diff | .1542106 .0590242 .0385253 .2698959 | under Ho: .0595634 2.59 0.010 -----------------------------------------------------------------------------diff = prop(0. anidulafu) - prop(1. fluconazo) z = 2.5890 Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.9952 Ha: diff != 0 Pr(|Z| < |z|) = 0.0096 Ha: diff > 0 Pr(Z > z) = 0.0048 Converting to percents, the result is 15.4% and (3.9% , 27.0%). We see that noninferiority of anidulafungin relative to fluconazole was demonstrated, since the lower bound of the two-sided 95% confidence interval does not cross -20%. We are now justified to test for superiority by either comparing the lower bound of this same confidence interval to the zero, or by an ordinary significance test of the difference. Since we are testing for superiority only after first establishing noninferiority, no adjustment for multiplicity (multiple comparisons) is required. Superiority of anidulafungin to fluconazole is demonstrated since the 3.8% is greater than 0%. Likewise, we can use the two-sided p value, which is statistically significant (p=0.0096, which we report as p=0.01). In their article, Reboli et al, did just this, reported this result in their Result section stating, “For the primary end point of global response at the end of intravenous therapy, a successful outcome was achieved in 96 of 127 patients in the anidulafungin group (75.6%), as compared with 71 of 118 patients in the fluconazole group (60.2%) (difference, 15.4 percentage points; 95% confidence interval [CI], 3.9 to 27.0); therefore, anidulafungin met the prespecified criteria for noninferiority to fluconazole. Since the confidence interval for the difference excluded 0, there was a significantly greater response rate in the anidulafungin group (P=0.01). Chapter 2-16. Validity and reliability Chapter 2-17. Methods comparison studies Chapter 2-99 (revision 8 Jan 2012) p. 43 References Cuellar NG, Ratcliffe SJ. (2009). Does valerian improve sleepiness and symptom severity in people with restless legs syndrome? Alternative Therapies 15(2):22-28. Daniel WW. (1995). Biostatistics: A Foundation for Analysis in the Health Sciences. 6th ed. New York, John Wiley & Sons. Kumara HMCS, Feingold D, Kalady M. et al. (2009). Colorectal resection is associated wth persistent proangiogenic plasma protein changes: a postoperative plasma stimulates in vitro endothelial cell growth, migration, and invasion. Ann Surg 249(6):973-977. Hidebrand DK. (1986). Statistical Thinking for Behavioral Scientists. Boston, Duxbury Press. Miller-Davis C, Marden S, Leidy NK. (2006). The New York Heart Association Classes and functional status: what are we really measuring? Heart Lung 35(4):217-24. Munro BH. (2001). Statistical Methods for Health Care Research. 4th ed. Philadelphia, Lippincott. Onyike CU, Crum RM, Lee HB, Lyketsos CG, Eaton WW. (2003). Is obesity associated with major depression? Results from the third national health and nutrition examination survey. Am J Epidemiol 158(12):1139-1153. Reboli AC, Rotstein C, Pappas PG, et al. (2007). Anidulafungin versus fluconazole for invasive candidiasis. N Engl J Med 356;24:2472-2482. Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Duxbury Press. Sulkowski MS, Thomas DL, Chaisson RE, Moore RD. (2000). Hepatotoxicity associated with antiretroviral therapy in adults infected with human immunodeficiency virus and the role of hepatitis C or B virus infection. JAMA 283(1):74-80. {cited in: Ch 2-1, 3-5, 3-9} Chapter 2-99 (revision 8 Jan 2012) p. 44