Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Uses of Probabilities in Epidemiology Research C. Murray Ardies, Ph.D. Professor of Health Sciences Northeastern Illinois University Textbook Definition of Epidemiology … combination of knowledge and research methods concerned with the distribution of determinants of health and illness in populations and with contributors to health and control of health problems … comprises an analytic, descriptive component termed classical epidemiology and a component concerned with critical appraisal of the research literature and diagnosis and management of illness, which is termed clinical epidemiology. - PDQ EPIDEMIOLOGY by David Streiner and Geoffry Norman, 1996, Mosby My Definition of Epidemiology (which is actually a composite definition from a dozen or more texts and websites on epidemiology): Classical Epidemiology Study of the incidence and distribution of determinants and deterrents of morbidity and mortality in human populations Modern Epidemiology Study of the incidence and distribution of determinants and deterrents of morbidity and mortality in manipulated & nonmanipulated human populations (may not be a better definition, but it is shorter!!) The research processes often (but not necessarily) start with an epidemiology approach, either to “determine cause” of something new that has appeared, or to figure out cause of something that has been well characterized by symptoms but not yet understood by etiology. A change in incidence of “something” can very easily indicate that “something” is going on and recognition of that change will initiate a series of investigations (how’s that for vague). The job of the CDC (USA) is to monitor incidence of diseases in the US and to investigate any sudden changes; such as an increase in the number of cases of an existing disease or simply an increase in mortality due to unknown causes Reports of Cancer in Plumcoulee, MB (Population 200, 1987) Patient J 1 2 3 4 5 6 7 8 9 10 11 12 F M A M J J A S O N D …………………………X D…………………………………X ………………………………………………………… ………………………………………………….X D...C D……………………………X …………C D………………………………… D…………………………..X D….. D…………………….C D…………….X 4 Cases to start 12 Cases Total 8 New Cases 6 Deaths Incidence – How many people get the disease number of new cases divided by the number of people at risk Annual Incidence in Plumcoulee, MB, 1987 8 new cases / 196 at risk = 0.0408 cases / year = 40.8 cases /1000 people / year Some Incidence Data for Cancer (USA) 404.9 cancers / 100,000 females in 2001* (437**) 544.8 cancers / 100/000 males in 2001 (482) 0.3 cases lip cancers / 100,000 females in 2001 (0.4) 1.4 cases lip cancers / 100,000 males in 2001 (1.2) 127.2 breast cancers / 100,000 females in 2001 (135.1) 1.4 breast cancers / 100,000 males in 2001 (1.2) 45.8 colon & rectum cancers / 100,000 females in 2001 (51.4) 62.7 colon & rectum cancers / 100,000 males in 2001 (54.3) 53.2 lung & bronchus cancers / 100,000 females in 2001 (57.9) 87.7 lung & bronchus cancers / 100,000 males in 2001 (76.7) *Age adjusted, CDC **(crude) Some more entertaining ways to play with the incidence data … Relative Risk and Odds Ratio Relative Risk Use data from the Cholestyramine Study (Coronary Primary Prevention Trial) Cardiac Still Total deaths Alive Cholestyramine 30 (A) 1870 (B) 1900 Placebo 38 (C) 1868 (D) 1906 RR = [A / (A + B)] / [C / (C + D)] = (30 / 1900) / (38 / 1906) = 0.792 = risk of cardiac death while on the drug is 79.2 % relative risk of cardiac death while on the drug is 79.2% of that when not on the drug - or a risk reduction of ~ 21% by taking the drug Odds Ratio (also called relative odds) is an approximation of RR and is often used when disease incidence is very low and there is a long latency period; it is commonly used in case-control studies Using data from Wynder & Graham, JAMA, 1950 Smoker Nonsmoker OR = (A / C) / (B / D) Cases Controls Total 659 (A) 984 (B) 1643 25 (C) 348 (D) 373 = (659 / 25) / (984 / 348) = 26.4 / 2.83 = 9.33 Another way to look at this is: Odds of a lung cancer subject being exposed to smoke are 659/25 = 26.4 Odds of a non-cancer subject being exposed to smoke are 984/348 = 2.83 Relative Odds of lung cancer from smoke exposure are therefore 26.4 / 2.83 = 9.33 Some Probability stuff that is directly related to the incidence stuff … All of the prior examples looked at incidence of something from a total population (ok, except for the Wynder data); ie. Incidence of new cancer cases in the population of Plumcoulee … Incidence of death due to heart disease in the USA in 2001 … Incidence of death due to heart disease in a population of people who participated in an experiment … Appropriate standardized incidence rates or relative risks were then calculated Another way to look at the same incidence (or counting) data is to consider the numbers as illustrating probabilities rather than as fractions or ratios – of course this means the same thing, just another and more useful term to get used to. Lets use the data (ok, a very tiny selected piece of data) from the 1991 USA census … Infant Mortality in the USA (1991) Unmarried Married Total Deaths 16,712 18,784 35,496 Alive 1,197,142 2,878,421 4,111,059 Total 1,213,854 2,897,205 4,111,563 Infant Mortality in the USA (1991) Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Married 18,784 2,878,421 2,897,205 Total 35,496 4,075,563 4,111,059 The probability (or the Marginal Probability) of having a particular disease or condition can be illustrated using the following formula: P (D) (remember this one for later) Using the above data and the condition of infant mortality: P “probability” (D) “infant death” = 35,496 total deaths / 4,111,059 total live births = 0.0086 (8.6 infant deaths / 1000 live births) Notice that this result is the total incidence of infant mortality for the population; just presented in terms of a probability (0.0086) rather than an incidence … Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Infant Mortality in the USA (1991) Married Total 18,784 35,496 2,878,421 4,075,563 2,897,205 4,111,059 The probability of NOT having a particular disease or condition can be illustrated using the following formula: P (D) (remember this one for later) Using the above data and the condition of “infant living”: P “probability” (D) “infant alive” = 4, 075,563 still alive @ 1 yr. / 4,111,059 total live births = 0.9914 (991.4 still alive @ 1 yr. / 1000 live births) Notice that this result is the total incidence of infant not-mortality for the population; just presented in terms of a probability rather than an incidence … Sample-Based Epidemiology Concepts Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Infant Mortality in the USA (1991) Married Total 18,784 35,496 2,878,421 4,075,563 2,897,205 4,111,059 We rarely have the luxury of having the entire population at our disposal so we usually take a small (or large, if you have the money and time and even larger if you also have lots of post-docs to collate data) random sample from our selected population and estimate the population incidence (probabilities) based on the sample. This means that we will have errors in estimation; with big errors if we use small numbers of people in our samples and smaller errors if we use bigger numbers of people in our samples. Because of the error in estimating the population parameter, we have to calculate confidence limits for our estimate; our sample predicts a parameter but the parameter could be smaller or larger than the predicted value – so we need to know the range of possible values for the predicted parameter. To see how this works we have to delve into the incredibly cool Universe of Statistical Analysis. The terms confidence limits and estimate of population parameters are highly relevant to research in the health sciences because they are statistical concepts. Statistics and statistical analysis is nothing more than calculating measures of probability, association, central tendency and variance of sample data (statistics) and the probabilities that the calculated statistics relate to the target population (statistical analysis). Of course statistical probabilities are not exactly the same as the actual population probabilities of infant mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991; two separate population parameters. A parameter is any measure from a population while a statistic is any measure from a sample. If we test entire populations then we do not need statistical analysis. For example: If another population (lets say, another country) was measured in its’ entirety and the other country’s infant mortality and infant non-mortality were calculated as 0.0085 and 0.9915, respectively [compared to infant mortality (0.0086) and infant non-mortality (0.9914) for the USA in 1991] we could conclude with absolute certainty (100% confidence) that the two populations were completely different with regard to these two parameters because we would be absolutely certain that the calculated numbers are exactly descriptive of the respective populations (even though there is just a tiny difference between the two populations). Different numbers means different! However, because samples are not necessarily exactly representative of the population from which they came, differing numbers from two (or more) different samples do not necessarily guarantee that the samples came from two (or more) different populations. As previously mentioned, we simply NEVER (well, not very often anyway) have the luxury of being able to measure the entire population so we have to suffer with a (usually) small sample that was selected from the population. We then measure whatever it is we are interested in; lets say: “Infant Mortality” or “Height”, and then assume that our sample represents our population and that whatever the sample number is, that same number applies to the entire population from which the sample was selected. Because such an assumption may not be absolutely true; ie. the sample doesn’t perfectly represent the population, we need to have some idea of what the probability is that the sample does represent the population, in other words … If there is a low probability the sample is like the population, then we won’t have much confidence in our numbers ... If there is a high probability our sample represents our population, then we can have a higher level of confidence (but never 100% certain) in our numbers. To understand how these statistical calculations are made we need to start with a frequency distribution of the entire set of population data: An extremely accurate, but rather cumbersome way to describe data; especially if there were hundreds or thousands of people in the population . . . . . A little less accurate of a description but a whole lot easier to describe because only the shape of the line is being described; not each of the individual data points. Note that the shape of the line still accurately describes how the data is distributed on the number line, we just need a more accurate way to describe the line … And there even is a way to calculate those two parts of the curve. (If you look at the right and left halves of the curve separately, you may recognize them as sigmoid curves.) The measure of central tendency most often used to describe the peak of the data curve is called mu (µ) and the measure of variability most often used to describe the dispersion of the data along the number line is called the standard deviation (σ); which is equal to the square root of the variance (σ2). µ = ∑x/n σ2 = (commonly called the average – add up all the scores and divide by the total number of scores) ∑ (x - µ)2 ————— n (subtract the mean from each score, square each result, add up all the squares, and then divide by n; then take the square root to get σ) The µ corresponds to the exact point on the number line where the central peak of the frequency distribution curve sits and the σ corresponds to the exact point on the number line where the data starts to spread out faster away from the mid-point. An advantage of describing your population in terms of how the data is distributed on a number line using µ and σ is that any population can be represented by this exact same kind of a curved line; a line often called a normal curve. An important property of these curves is that they are very easy to describe in terms of mathematical probabilities. For example, we know that 50% of all the body weights (data points) in the population are greater than the center point (µ = 5’ 6.25”) which means there is a 0.50 probability that a randomly selected individual is taller than 5’ 6.75”. We also know that 68.26% of all the data points are between the 2 σ limits (4’ 1.75” to 6’ 10.75”) which means there is a 0.6826 probability that a randomly selected individual will be between 4’ 1.75” tall and 6’ 10.75” tall. This graph simply illustrates more “percentages of the data distributed along the number line” in different sections of the curve; based on how far along the number line you go in σ units. Again, using percent as probabilities, there is a 0.3413 probability that a randomly selected individual would be between the mean and one standard deviation above the mean, or to put it a different way, we would be 34.13% confident that a randomly selected individual would be somewhere between the mean and +1 sd, or 2.28% confident that a randomly selected individual would be +2sd above the mean . . . Note that the z-score number corresponds to the sd unit. Now . . .to figure out where the confidence limits actually come from in all those epidemiology papers . . . The “baby” data illustrates this fairly well . . . Sample1 Sample2 Sample3 Sample4 Births Births Births Births Unmarried 35 29 33 41 Married 65 71 67 59 Total 100 100 100 100 If we randomly sampled 100 live births from all of the 4,111,059 live births in the USA in 1991 we might find that 35 births were associated with unmarried mothers. This would give a sample probability (statistic) of 35 unwed mothers / 100 live births = 0.35 - an estimate of the population probability (parameter) that a birth is associated with an unmarried mother. The sample probability (statistic) is not the correct probability for the entire population, just the correct probability for the sample. If we took 3 more (different) random samples from the same population, each of 100 live births, we would probably find a different probability that the birth is associated with unwed mothers for each sample that was randomly selected; we might get 29 / 100 = 0.29; 33 / 100 = 0.33; 41 / 100 = 0.41; and so on . . . and we would never be 100% certain (confident) that any one sample probability would exactly represent the population parameter. We need some way to deal with this uncertainty so we construct confidence limits or a confidence interval. Sample1 Sample2 Sample3 Sample4 Marital status of samples of new mothers in the USA (1991) Unmarried Married Births 35 65 Births 41 59 Births 33 67 Births 29 71 Total 100 100 100 100 … If we could keep sampling samples (of n = 100) and calculating probabilities forever we would end up with an infinite number of sample probabilities. Sample probabilities close to the true population probability would appear numerous times while those far away would appear less frequently; the most frequently occurring sample probability (from the infinite number of samples) would correspond to the population probability while the least frequent probabilities would correspond to the extreme values (again, from the infinite number of samples). This infinite number of theoretical sample probabilities would obviously fit into some kind of frequency distribution curve that is normally distributed. From this theoretical Normal Distribution we can construct a confidence interval using standard percentile scores (actually the same sd units called z-scores illustrated in previous slides) which will then be related to just how confident we want to be; 95% confident? 90% confident? 99% confident? – just plug in the sample values you are interested in, and appropriate z-score value that corresponds to your chosen %-confidence level into the formula and voila: Confidence Intervals This is another figure of that same normal curve with z-scores and percentages; the actual z-scores that correspond to 95% and 90% of the data have been added … Just imagine that this curve illustrates the distribution of an infinite number of probabilities calculated from the infinite number of samples (n = 100) that were randomly selected from the same population) We already have some idea where the middle of this “population curve” fits on a number line because we have the (ONE) sample estimate of that point; we are just not 100% confident that the sample statistic is exactly the same as the population parameter. What we need to know is the range of possible values that the actual population center-point might be within – so we calculate that range using the above theoretical curve … Marital status of a sample of new mothers in the USA (1991) Births Unmarried 35 Married 65 Total 100 Probability 0.35 Confidence Interval - 95% (use z-score of 1.96) 0.35 x 0.65 0.35 ± ( 1.96 √ —————— 100 ) = = 0.35 ± (1.96 √0.002275) 0.35 (0.257, 0.443) Confidence Interval - 90% (use z-score of 1.644) = 0.35 ± (1.644 √0.002275) = 0.35 (0.272, 0.428) *True population probability = 0.295 (1,213,854 / 4,111,059) The confidence interval is simply the range of values in a frequency distribution of values from all possible samples of the same size between which you might expect to find the true population value (parameter), ie. The sample statistic predicts that the parameter is 0.35 but it is 90% probable the true parameter is somewhere between 0.272 and 0.428; and 95% probable the parameter is between 0.257 & 0.443. These two graphs illustrate the previous calculations as well as the effect of sample size on the “accuracy” of using the sample statistics to predict the population variance. From the previous formula, the zscore values (1.96 or 1.644) describe the confidence limits between which we will look for our predicted population “value” The term √ (0.35 x 0.65) / 100 is a calculation of the sample variance – note that the sample n is part of the equation. The larger the n, the narrower the variance (n=1000 = .285 - .305 vs. n=100 = .3 - .4) in predicting the population variance. With smaller sample sizes, or with highly variable data, or with p ~ 0 or 1, it is problematic to accurately predict population variance using the sample variance, so this next formula is actually used a lot more: (2 x 100 x 0.35) + 1.962 ± 1.96 √1.962 + (4 x 100 x 0.35 x 0.65) —————————————————————————————— 2 ( 100 + 1.962) [ previous calculation = 35 ± = 35 ± (0.257, 0.43) ] True population probability = (0.264, 0.447) 0.295 *You will notice that all epidemiology publications will give the confidence intervals associated with each variable measured. **and since computers do all the work nowadays and they can calculate exact intervals based on the sampling distribution of P, based on the binomial distribution, we don’t have to bother with knowing any of these formulas, just have an idea about what the formulas are actually calculating … Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Married 18,784 2,878,421 2,897,205 Total 35,496 4,075,563 4,111,059 Getting back to the original population data, we can now extend the probability concept. The previous examples used total population data without looking at any of the sub-categories of data. Remember: P “probability” (D) “infant death” = 35,496 total deaths / 4,111,059 total live births = 0.0086 (8.6 infant deaths / 1000 live births) If we are interested in the probability of infant death for a birth associated with the unmarried status of the mother, then we have to look at the conditional probability of the outcome; producing a new formula: P (A | B) = P “probability” (A “infant death” | “conditional on” B “unmarried mother”) P (A | B) = f A : B / (f A + f A) = 16,712 / (16,712 + 1,197,142) = 16,712 / 1,213,854 = 0.014 or 14 deaths / 1000 births Deaths Alive Total Unmarried 16,712 1,197,142 1,213,854 Married 18,784 2,878,421 2,897,205 Total 35,496 4,075,563 4,111,059 An equivalent formula would be: P (A | B) = P (A & B) —————— = P (B) f A& B / (total) f ————————— f B / (total) f P (A & B) = 16,712 / 4,111,059 = 0.0041 P (B) = 1,213,854 / 4,111,059 = 0.295 P (A | B) = 0.0041 / 0.295 = 0.014 or 14 deaths / 1000 births Notice that the P (A | B) is simply the joint probability A & B (which = incidence of unmarried-mother births that do not survive), divided by the marginal probability B (which = incidence of unmarried mother births) Now, all we have to do is take this probability approach to the concepts of relative risk, odds ratio, attributable risk, excess risk, and linear regression of P relative to levels of exposure … To do this we need to go back to those probability expressions: P (D) P (D) P (E) P (E) and a brief review of the concepts of Joint Probabilities, Marginal Probabilities, and Conditional Probabilities. Lets now use a sample (n = 200) to illustrate these associations between low birthweight infants and marital status of the mother: Unmarried Married Total Low 7 7 14 Joint Probability: (P within population) Marginal Probability: (P within population) Conditional Probability: (P within conditioned variable) Birthweight Normal 52 134 186 Total 59 141 200 P (unmarried mother AND low birthweight) = 7 / 200 = 0.035 P (not unmarried AND not low birthweight) = 134 / 200 = 0.67 P (low birthweight infant) = 14 / 200 = 0.07 P (not low birthweight infant) = 186 / 200 = 0.93 P (low birthweight | unmarried) = 7 / 59 = 0.119 P (low birthweight | not unmarried) = 7 / 141 = 0.050 Relative Risk is the ratio of 2 conditional probabilities; you simply take the probability of the disease in question conditional on the presence of the risk factor and divide that probability by the probability of disease conditioned on the absence of the risk factor. P (D | E) RR = —————— P (D | E) Birthweight Unmarried Married Total Low 7 7 14 Normal 52 134 186 Total 59 141 200 Using the infant data with low birthweight as the “disease” and unmarried mother as the “risk” : RR = P (D | E) —————— P (D | E) = 7 / 59 ——— = 7 / 141 0.118644 ———— = 0.049645 2.39 Indicating there is a 2.39 fold increase in risk for low birthweight if the mother is unmarried; relative to the risk of low birthweight if the mother is married. Thune & Eiliv Lund, The Influence of Physical Activity on Lung-Cancer Risk A Prospective Study of 81,516 men & women. Int. J. Cancer: 70, 57-62, 1997 RELATIVE RISK* FOR LUNG CANCER IN MALES Occupational Physical Activity Sedentary Walk Lifting Heavy Labor RR 1.00 1.15 1.13 0.99 Recreational Physical Activity Sedentary Moderate Regular Training RR 1.00 0.75 0.71 95% CI reference 0.90 - 1.47 0.87 - 1.47 0.70 - 1.41 p (trend) = 0.71 95% CI Reference 0.60 - 0.94 0.52 - 0.97 p (trend) = 0.01 *Adjusted for age, BMI, region, smoking habits (amount and duration) RELATIVE RISK FOR LUNG CANCER IN MALE SMOKERS (>15 cig/day) Recreational Physical Activity Sedentary Moderate Regular Exercise RR 1.00 0.77 0.59 95% CI 0.52 – 0.96 0.35 – 0.97 p (trend) = 0.01 Odds Ratio is a slightly different concept; it compares the odds of D in the exposed and unexposed subgroups. OR = P (D | E) —————— P (D | E) ÷ P (D | E) —————— P (D | E) Using the same data: Birthweight Unmarried Married Total Low 7 7 14 P (D | E) P (D | E) OR = ———— ÷ ———— P (D | E) P (D | E) Normal 52 134 186 Total 59 141 200 7 / 59 7 / 141 0.13461 = ——— ÷ ———— = ———— = 2.58 52 / 59 134 / 141 0.05223 Indicating that the odds of a low birthweight infant for unmarried mother are 2.58 fold greater than the odds of a low birthweight infant if the mother is married Friedenreich, CM, Bryant, HE, and Courneya, KS, Case-Control Study of Lifetime Physical Activity and Breast Cancer Risk. Am. J. Epidemiol. 154 (4): 336-347, 2001. ODDS RATIOS FOR BREAST CANCER IN POSTMENOPAUSAL WOMEN Lifetime Total Physical ActivityOR 0 - 104 MET hours/week/year 104 - 128 128 - 160 160+ 95% CI 1.00 0.74 0.82 0.80 0.56 - 0.98 0.62 - 1.08 0.61 - 1.06 p (trend) = 0.04 ODDS RATIOS FOR BREAST CANCER RISK IN NULLIPAROUS WOMEN Lifetime Total Physical ActivityOR 0-104 MET hours/week/year 104 - 128 128 – 160 160+ 95% CI 1.00 0.36 0.88 0.34 0.15 - 0.85 0.36 – 2.13 0.12 – 0.94 p (trend) = 0.02 Attributable Risk is a different concept altogether; it relates to absolute differences in risk rather than relative (or ratios of) risks. AR = P (D) - P (D | E) ————————— P (D) Birthweight Unmarried Married Total Low 7 7 14 P (D) - P (D | E) AR = ————————— P (D) Normal 52 134 186 = Total 59 141 200 14 / 200 - 7 / 141 —————————— 14 / 200 = 0.29 Indicating that 29% of low birthweights in the population (ok, really in the sample) are attributed to the marital status of the mother. Of course, being unmarried does not cause low-birthweights but rather there are many factors associated with being an unmarried mother that may be causal. Excess Risk also is a concept relating to absolute differences in risk rather than relative (or ratios of) risks. ER = P (D | E) - P (D | E) Birthweight Unmarried Married Total ER = Low 7 7 14 P (D | E) - Normal 52 134 186 7 P (D | E) = —— 59 Total 59 141 200 - 7 —— 141 = 0.069 Indicating that there would be an increase of ~ 7% in low birthweights in the population (ok, really in the sample) if marital status of the mother changed from married to unmarried. Of course, being unmarried does not cause lowbirthweights but rather there are many factors associated with being an unmarried mother that may be causal. Regression Analysis (several different variations, but only two will be illustrated) is used when one or more of the variables are stratified or continuous in nature. These analyses can illustrate how risk (or probability) for disease may change when the degree of exposure changes. Linear model Px = P (D | X = x) = a + bx Plotting the probabilities of D conditional on exposure level X (at each level x determined) on a graph produces a straight line with intercept = a and slope = b (changes in P for each unit of x). The intercept (a) illustrates the risk of D as a probability when exposure = 0. The slope of the line illustrates excess risk for each increase in E in x units (whatever unit E was measured in) . . . Logistic Regression Analysis (and multiple logistic regression analysis) are used extensively in epidemiology research because associations between the calculated probabilities and exposure variables as measured are rarely perfectly linear. Px log ( ——— ) = log (odds for D | X = x) = a + bx 1 - Px Plotting the log of the odds of D conditional on exposure level X = (at each level x determined) on a graph, often produces a curved line with intercept = a and slope = b (changes in P for each unit of x). Again, the intercept illustrates risk with exposure = 0 and the slope is the change in the log of the OR (for each change in the level of E). Multiple logistic regression is used when there are more than one exposure variables measured and the log OR is a function which takes into account all of the measured variables associated with risk. (notice there were no formulas presented with which the a and bx are calculated) RR, OR, AR, ER, and Logistic Regression Analysis are all used in epidemiology research to locate and characterize possible D:E associations. Confidence limits are always calculated for each risk relationship to illustrate the potential error in predicting population parameters from sample statistics. OR, RR, ER, and AR are commonly used with binary data while logistic regression analysis is used when one or more of the E variables are either stratified or continuous in nature . . .