Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sample-Based Epidemiology Concepts: Conditional Probabilities, Relative Risk, Attributable Risk, Excess Risk, & Linear Regression Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Married 18,784 2,878,421 2,897,205 Total 35,496 4,075,563 4,111,059 Getting back to the original population data, we can now extend the probability concept. The previous examples used total population data without looking at any of the sub-categories of data. Remember: P “probability” (D) “infant death” = 35,496 total deaths / 4,111,059 total live births = 0.0086 (8.6 infant deaths / 1000 live births) If we are interested in the probability of infant death for a birth associated with the unmarried status of the mother, then we have to look at the conditional probability of the outcome; producing a new formula: P (A | B) = P “probability” (A “infant death” | “conditional on” B “unmarried mother”) P (A | B) = f A : B / (f A + f A) = 16,712 / (16,712 + 1,197,142) = 16,712 / 1,213,854 = 0.014 or 14 deaths / 1000 births Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Married 18,784 2,878,421 2,897,205 Total 35,496 4,075,563 4,111,059 An equivalent formula would be: P (A | B) = P (A & B) —————— = P (B) f A& B / (total) f ————————— f B / (total) f P (A & B) = 16,712 / 4,111,059 = 0.0041 P (B) = 1,213,854 / 4,111,059 = 0.295 P (A | B) = 0.0041 / 0.295 = 0.014 or 14 deaths / 1000 births Notice that the P (A | B) is simply the joint probability A & B (which = incidence of unmarried-mother births that do not survive), divided by the marginal probability B (which = incidence of unmarried mother births) Now, all we have to do is take this probability approach to the concepts of relative risk, odds ratio, attributable risk, excess risk, and linear regression of P relative to levels of exposure … To do this we need to go back to those probability expressions: P (D) P (D) P (E) P (E) and a brief review of the concepts of Joint Probabilities, Marginal Probabilities, and Conditional Probabilities. Lets now use a sample (n = 200) to illustrate these associations between low birthweight infants and marital status of the mother: Unmarried Married Total Low 7 7 14 Joint Probability: (P within population) Marginal Probability: (P within population) Conditional Probability: (P within conditioned variable) Birthweight Normal 52 134 186 Total 59 141 200 P (unmarried mother AND low birthweight) = 7 / 200 = 0.035 P (not unmarried AND not low birthweight) = 134 / 200 = 0.67 P (low birthweight infant) = 14 / 200 = 0.07 P (not low birthweight infant) = 186 / 200 = 0.93 P (low birthweight | unmarried) = 7 / 59 = 0.119 P (low birthweight | not unmarried) = 7 / 141 = 0.050 Relative Risk is the ratio of 2 conditional probabilities; you simply take the probability of the disease in question conditional on the presence of the risk factor and divide that probability by the probability of disease conditioned on the absence of the risk factor. P (D | E) RR = —————— P (D | E) Birthweight Unmarried Married Total Low 7 7 14 Normal 52 134 186 Total 59 141 200 Using the infant data with low birthweight as the “disease” and unmarried mother as the “risk” : RR = P (D | E) —————— P (D | E) = 7 / 59 ——— = 7 / 141 0.118644 ———— = 0.049645 2.39 Indicating there is a 2.39 fold increase in risk for low birthweight if the mother is unmarried; relative to the risk of low birthweight if the mother is married. Odds Ratio is a slightly different concept; it compares the odds of D in the exposed and unexposed subgroups. OR = P (D | E) —————— P (D | E) ÷ P (D | E) —————— P (D | E) Using the same data: Birthweight Unmarried Married Total Low 7 7 14 P (D | E) P (D | E) OR = ———— ÷ ———— P (D | E) P (D | E) Normal 52 134 186 Total 59 141 200 7 / 59 7 / 141 0.13461 = ——— ÷ ———— = ———— = 2.58 52 / 59 134 / 141 0.05223 Indicating that the odds of a low birthweight infant for unmarried mother are 2.58 fold greater than the odds of a low birthweight infant if the mother is married Excess Risk is a different concept altogether; it relates to absolute differences in risk rather than relative (or ratios of) risks. ER = P (D | E) - Using the current example: Unmarried Married Total ER = P (D | E) - P (D | E) Birthweight Low 7 7 14 7 P (D | E) = —— 59 Normal 52 134 186 - 7 —— 141 Total 59 141 200 = 0.069 Indicating that there would be an increase of ~ 7% in low birthweights if exposure status changed from E to E. One use of ER is to evaluate possible impact of public health programs. For example, the RR of lung cancer associated with smoking is ~ 9.0 while that of CHD associated with smoking is ~ 2.3. The ER of CHD with smoking exposure is, however, larger (0.0024 vs. 0.0016) than that of lung cancer; indicating that cigarette intervention programs may be more important in terms of CHD. Attributable Risk is a slightly different concept. It relates to absolute differences in risk but also incorporates the association between E and D as well as the prevalence of E. It attempts to address the question of what is the fraction of the total disease in the population that can be explained by the presence of E. AR = P (D) - P (D | E) ————————— P (D) Using the current example: Unmarried Married Total Birthweight Low 7 7 14 P (D) - P (D | E) AR = ————————— P (D) Normal 52 134 186 = 14 / 200 - 7 / 141 —————————— 14 / 200 Total 59 141 200 = 0.29 Indicating that 29% of low birthweights in the population (ok, really in the sample) are attributed to the marital status of the mother. Of course, being unmarried does not cause low-birthweights but rather there are many factors associated with being an unmarried mother that may be causal. As before, the use of samples precludes certainty as to whether your calculated indices of risk truly represent the population from which the sample was taken. In order to deal with this uncertainty, confidence limits can be calculated for these risk indices in much the same way as they were calculated for the marginal probabilities as explained earlier. - The calculated RR (or OR) from the sample is assumed to be a reasonable estimate of the true population risk. - Based on the sample, a frequency distribution of RR’s (or OR’s) from an infinite number of samples with the same N as used in your sample is predicted. - The calculated RR is assumed to be in the middle of the frequency distribution and 90% or 95% CI’s are calculated. - The lower limit and higher limit of the CI reveals the possible range of values that the population RR (or OR) might be within (with a 90% or 95% probability). Calculating confidence limits for the “ratios” is a slightly different affair than for marginal probabilities because you are looking at associations between two (or more) different probabilities, which somehow have to be combined. Because OR is very commonly used in epidemiology, we’ll start with that: To make the arithmetic easier, the OR formula is rearranged: Using some new data on coffee consumption and pancreatic cancer: D = Cancer (a) 347 (c) 20 (a + c) 367 E = Coffee E = Not Coffee Total D = Not Cancer (b) 555 (d) 88 (b + d) 643 Total (a + b) 902 (c + d) 108 (a + b + c + d) 1010 From before: P (D | E) P (D | E) OR = ———— ÷ ———— P (D | E) P (D | E) 347/902 20 /108 0.625224 = ——— ÷ ———— = ———— = 2.75 555 /902 88 / 108 0.227272 Rearranging the formula (902’s & 108’s cancel, then cross multiply) gives: OR = ad ———— bc = 347 x 88 ———— 555 x 20 = 30536 ———— 11100 = 2.75 As before, in estimating confidence limits you use a theoretical distribution of an infinite number of sample statistics. In this case - OR, however, the frequency distribution of an infinite number of OR values would be skewed to the left and therefore not very useful. In order to deal with this, you make some sort of a transformation of the data. In the case of OR, a frequency distribution of an infinite number of OR values that have been transformed to their natural logarithms would yield a symmetrical curve that is normally distributed. You then can use the appropriate z-score x √ variance to obtain the desired confidence limits (90%?, 95%?). For 95% CI: nl OR ± 1.96 √V You then obtain the exponent of the calculated values to convert back into our normal language. EXAMPLE (coffee consumption and pancreatic cancer): E = Coffee E = Not Coffee Total D = Cancer (a) 347 (c) 20 (a + c) 367 ad OR = ———— bc = D = Not Cancer (b) 555 (d) Total 88 (b + d) 643 347 x 88 ———— 555 x 20 = (a + b) 902 (c + d) 108 (a + b + c + d) 1010 30536 ———— 11100 = 2.75 To calculate confidence limits: Take the natural log of the OR (2.75): nl OR = 1.0116 Calculate the variance of the nl OR: 1 1 V nl OR = — + — a b 1 1 V nl OR = —— + —— 347 555 1 1 + —— + —— = 20 88 0.066 1 + — c 1 + — d Now that you have the nl OR and V nl OR you just choose the z-score that corresponds to your desired CI and calculate away … nl OR = 1.0116 V nl OR = 0.066 If you decide to pick a 95% CI then you use the z-score of 1.96. (If you wanted to use 90% CI then you would use a z-score of 1.64) 95% CI of nl OR = 1.0116 ± 1.96 √ 0.066 = 1.0116 ± 0.5035 = 1.0116 (0.508, 1.516) Remember that these are natural logs so we have to use the exponent of the nl’s to convert back to “normal arithmetic” and we get: exp 1.0116 = 2.75 exp 0.508 = 1.66 exp 1.516 = 4.55 95% CI OR = 2.75 (1.66, 4.55) Indicating that the odds of cancer with coffee exposure are anywhere from one-and-a-half to four-and-a-half times greater than the odds of cancer without coffee exposure For RR, a different example will be used: EXAMPLE (Personality Type and CHD): E = Type A E = Type B Total D = CHD (a) 178 (c) 79 (a + c) 257 D = Not CHD (b) 1411 (d) 1486 (b + d) 2897 Total (a + b) 1589 (c + d) 1565 (a + b + c + d) 3154 Using the rearranged formula: a / (a + b) RR = ———— c / (c + d) = 178 / 1589 ———— = 79 / 1565 0.1122 ———— 0.0505 = 2.22 To calculate confidence limits: Take the natural log of the RR (2.22): nl RR = 0.797 Calculate the V of the nl RR: b d V nl RR = ——— + ——— a (a + b) c (c + d) 1411 V nl RR = ———— 178 x 1589 1486 + ———— 79 x 1565 = 0.017 Now that you have the nl RR and V nl RR you just choose the z-score that corresponds to your desired CI and calculate away (sound familiar?) … nl RR = 0.797 V nl RR = 0.017 If you decide to pick a 95% CI then you use the z-score of 1.96. (If you wanted to use 90% CI then you would use a z-score of 1.64) 95% CI of nl RR = 0.797 ± 1.96 √ 0.017 = 0.797 ± 0.2555 = 0.797 (0.542, 1.053) Remember that these are natural logs so we have to get the exponent of the nl’s to “convert back to normal arithmetic” and we get: exp 0.797 = 2.22 exp 0.542 = 1.72 exp 1.053 = 2.87 95% CI RR = 2.22 (1.72, 2.87) Indicating anywhere from about 2 to almost 3 times the risk for CHD in Type A individuals But what about that statistical analysis stuff which “figures out” if your data is “significant” . . . . . ? And all that probability stuff so you can say your results are Significant at the 0.05 level? - or - Significant at the 0.1 level? - or - Significant at the 0.01 level? This basic statistical concept is related to the question: Does your calculated OR and/or RR, which reveal observed associations between D and E in your sample, reflect a population in which these same associations are actual? (Or, are the calculated associations simply the result of some random happenings resulting from your sampling process?) One way to surmise the answer is to simply take into consideration the 95% CI of your ratios and as long as the range of possible values is not very “large” you can be reasonably sure that your calculated RR (or OR) is a reasonably accurate measure of the population RR or OR. And, if the lower limit is not close to 1, there is a reasonable likelihood that the presence of E entails more risk for “getting” D than the absence of E will. If the lower limit includes 1, then you can’t really preclude the possibility of NO effect with 95% confidence . There is a trend in modern epidemiology to rely less and less on tests of significance and to rely more on using the calculated CI’s as a direct indication of the true significance of your ratios’ value. On the other hand, there are still a lot of statisticians (and journals) who insist on using tests of significance to give “statistical relevance” to any conclusions made on the basis of the calculated risk indices. Because all of the analyses discussed so far used frequency data arranged in a 2 x 2 table of E, E, D, and D, analyzing for statistical significance will based on these frequencies. To illustrate we will use some case-control data on smoking and fatal / non-fatal automobile accidents from which we obtain the following 2 x 2 contingency table: D 68 32 100 E E Total D 104 96 200 Total 172 128 300 From this data we might calculate the OR OR = (68 x 96) / (32 x 104) = 6528 / 3328 = 1.96 nl OR = 0.6729 V nl OR = 1/68 + 1/104 + 1/32 + 1/96 = 0.06599 95% CI nl OR = 0.6729 ± 1.96 √ 0.06599 = 0.6729 (0.16941, 1.17639) . . . Convert to real numbers and you get: OR (95% CI) = 1.96 (1.185, 3.243) The fundamental analysis to be used is the chi-square test for independence and it is based on the premise that if the exposure variable was completely independent of the disease variable then we would expect that the P (D |E) = P (D |E) and that would mean that the null hypothesis for our test would be: H0 = RR = 1 = OR = 1 (and ER = 0) In tests of statistical significance it is customary to formulate the test statement in the form of a null hypothesis. The null hypothesis assumes that there is no treatment effect or no association between E and D variables and based on the results of the statistical analysis used (an analysis which is appropriate for the type of data collected and the research design used) we will either take one of two possible actions: We will reject the null hypothesis if our data indicates a reasonable probability of differences or associations, - or – We will fail to reject the null hypothesis if our data do not indicate a reasonable probability of differences or associations. By using the chi-square test for independence we will get a statistic that represents just how different the observed frequencies are from those that would have occurred if there was no association between the variables. Using the case-control data on smoking and accidents we might OBSERVE the following contingency table: E E Total D 68 32 100 D 104 96 200 Total 172 128 300 If there was absolutely no association between E and D we would expect identical incidences of D in both the E and E groups, giving the following *EXPECTED results: E E Total D 57.3 42.6 100 D 114.6 85.3 200 Total 172 128 300 *100 fatality cases from 300 total subjects = 33.3 %; which means that 33.3 % of the E group (57.3) would have D and 33.3 % of the E group would have D (42.6) . . . if no association between E and D existed. The basic formula for the chi-square test for independence is as follows: χ2 = - ∑ (O-E)2 ———— E or with Yates correction for 2 x 2 tables χ2 = ( │O – E│- ½ )2 —————— E . . . and all we have to do is plug in the observed frequencies and the expected frequencies into the formula to obtain our chi-square statistic. χ2 = (68 - 57.3)2 (104 – 114.6)2 ———— + ———— + 57.3 114.6 χ2 = 1.985 + 0.992 + 2.6 + 1.3 χ2 = (32 – 42.6)2 ———— 42.6 6.98 So, what does this mean? + (96 – 85.3)2 ———— 85.3 Before we can figure that out, one of the first things to do is imagine a frequency distribution of an infinite number of chi-square values that were calculated from an infinite number of samples that were randomly drawn from a population in which there was NO association between D and E. You can imagine that most of the samples would have values close to 0, some would be close to 1, or 2, while a few would, through random chance, be much, much larger than 2. The following table illustrates the probability of randomly selecting a sample with a given χ2 value from a population where P (D│E) = P(D│E): df 1 2 3 4 5 0.995 0.010 0.072 0.207 0.412 0.99 0.020 0.115 0.297 0.554 0.975 0.001 0.051 0.216 0.484 0.831 Probability 0.95 0.90 0.10 0.004 0.016 2.706 0.103 0.211 4.605 0.352 0.584 6.251 0.711 1.064 7.779 1.145 1.610 9.236 0.05 3.841 5.991 7.815 9.488 11.070 0.025 0.01 0.005 5.024 6.635 7.879 7.378 9 .210 10.597 9.348 11.345 12.838 11.143 13.277 14.860 12.833 15.086 16.750 - and a whole lot more numbers at higher values of df. . . . . df refers to: (number of rows – 1) x (number of columns – 1) In the case of a 2 x 2 table, df = 1 From the example of smoking and fatal automobile accidents we got a calculated χ2 value of 6.98. From the table with df = 1, we can see that there is less than a 1% chance of randomly selecting such a sample from a population where P (D│E) = P (D│E). I reproduced the relevant part of the table here: df 0.995 1 - 0.99 - 0.975 0.001 0.95 0.004 0.90 0.016 0.10 2.706 0.05 3.841 0.025 5.024 0.01 6.635 0.005 7.879 Based on the statistical analysis, the observed frequencies of fatal accidents (and not-fatal accidents) in the exposed and unexposed groups have less than a 1% chance of being derived from a population where P (D│E) = P (D│E) and therefore we reject the null hypothesis of No Association between D and E. - KEEP THIS CONCEPT IN MIND - There appears to be an association between smoking and fatal automobile accidents such that smoking confers an increased risk for dying in an accident. Of course, without the OR, you can’t tell the degree of the association so just doing the chi-square test for “significance” isn’t enough. Without the CI, the “accuracy” of the prediction also isn’t apparent so you need that too. From the original analysis, we obtained OR (95% CI) = 1.96 (1.185, 3.243) From the χ2 analysis, we rejected the null hypothesis with a probability of < 0.01. We would illustrate the results of our analysis of smoking and fatal accidents in a manner something like this: OR (95% CI) = 1.96 (1.185, 3.243), P < 0.01 - or . . . to put things in proper perspective There is less than a 1% chance of being wrong by concluding that, with 95% confidence, the odds of a smoker being involved in a fatal automobile accident appears to be anywhere from ~ 20% to ~ 320% greater than the odds for non-smokers to be involved in a fatal automobile accident. Regression Analysis (several different variations, but only two will be illustrated) is used when one or more of the variables are stratified or continuous in nature. These analyses can illustrate how risk (or probability) for disease may change when the degree of exposure changes. Linear model Px = P (D | X = x) = a + bx Plotting the probabilities of D conditional on exposure level X (at each level x determined) on a graph produces a straight line with intercept = a and slope = b (changes in P for each unit of x). The intercept (a) illustrates the risk of D as a probability when exposure = 0. The slope of the line illustrates excess risk for each increase in E in x units (whatever unit E was measured in) . . . Logistic Regression Analysis (and multiple logistic regression analysis) are used extensively in epidemiology research because associations between the calculated probabilities and exposure variables as measured are rarely perfectly linear. Px log ( ——— ) = log (odds for D | X = x) = a + bx 1 - Px Plotting the log of the odds of D conditional on exposure level X = (at each level x determined) on a graph, often produces a curved line with intercept = a and slope = b (changes in P for each unit of x). Again, the intercept illustrates risk with exposure = 0 and the slope is the change in the log of the OR (for each change in the level of E). Multiple logistic regression is used when there are more than one exposure variables measured and the log OR is a function which takes into account all of the measured variables associated with risk. (notice there were no formulas presented with which the a and bx are calculated) RR, OR, AR, ER, and Logistic Regression Analysis are all used in epidemiology research to locate and characterize possible D:E associations. Confidence limits are always calculated for each risk relationship to illustrate the potential error in predicting population parameters from sample statistics. OR, RR, ER, and AR are commonly used with binary data while logistic regression analysis is used when one or more of the E variables are either stratified or continuous in nature . . . The next couple slides are here in an attempt to go back to the original course discussion on the ideal vs. the political reality . . . Many standard assumptions about disease risks were established on the basis of linear regressions or relative risks in conjunction with an inaccurate assumption of cause. - it is entirely possible to calculate a linear regression between any variable with any other to see what associations exist . . . For example it is possible to calculate a linear regression between age-adjusted disease-specific mortality data (D) and fat consumption (E): Linear Regression of incidence of death due to breast cancer with animal fat intake - Ken Carroll, 1975 Linear Regression of incidence of death due to breast cancer with vegetable fat intake - Ken Carroll, 1975 Based on the strong associations between animal fat and breast cancer, various hypotheses of animal fat being causal for breast cancer were constructed . . . Similar associations between serum cholesterol and/or dietary fat and CHD also were observed in other studies; leading to many hypotheses regarding cholesterol and dietary fat being causal for CHD . . .