Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
The binomial applied: absolute and relative risks, chi-square Probability speak (just shorthand!)… P(X) = “the probability of event X” P(D) = “the probability of disease” P(E) = “the probability of exposure” P(~D) = “the probability of not getting the disease” P(~E)= “the probability of not being exposed” P(D/E) = “the probability of disease given exposure” or “the probability of disease among the exposed” P(D/~E) = “the probability of disease given unexposed” or “the probability of disease among the unexposed” Things that follow a binomial distribution… Cohort study (or cross-sectional): The number of exposed individuals in your sample that develop the disease The number of unexposed individuals in your sample that develop the disease Case-control study: The number of cases that have had the exposure The number of controls that have had the exposure Cohort study example: You sample 100 smokers and 100 nonsmokers and follow them for 5 years to see who develops heart disease. Seeing it as a binomial… The number of smokers that develop heart disease in your study follows a binomial distribution with N=100, p=pd/e The number of non-smokers that develop heart disease in your study follows a binomial distribution with N=100, pd/~e A possible outcome: Smoker (E) Heart disease (D) 21 Non-smoker (~E) 13 No Disease (~D) 79 87 100 100 Statistics for these data 1. Risk ratio (relative risk) 2. Difference in proportions (absolute risk) 3. Chi-square test of independence For 2x2 tables, mathematically equivalent to difference in proportions Z test. 1. Risk ratio (relative risk) Exposure (E) Disease (D) a No Exposure (~E) b No Disease (~D) c d a+c b+d risk to the exposed a /( a c) RR b /(b d ) risk to the unexposed In probability terms… Exposure (E) Disease (D) a No Exposure (~E) b No Disease (~D) c d a+c b+d Risk of disease in the exposed P( D / E ) a /(ac) RR b /(bd ) P( D / ~ E ) risk of disease in the unexposed Risk ratio calculation: Smoker (E) Heart disease (D) 21 Non-smoker (~E) 13 No Disease (~D) 79 87 100 100 RR 21/100 21 1.61 13/100 13 Interpretation: there is a 61% increase in risk of heart disease in smokers vs. nonsmokers Inferences about risk ratio… Is our observed risk ratio statistically different from 1.0? What is the p-value? I’m going to present statistical inference for odds ratio; risk ratio is similar. So, for now, just get answer from SAS: 95% confidence interval: 0.86 to 3.04 P-value>.05 2. Difference in proportions Exposure (E) Disease (D) a No Exposure (~E) b No Disease (~D) c d a+c b+d P( D / E ) P( D / ~ E ) a /( a c) b /(b d ) 2. Difference in proportions Smoker (E) Heart disease (D) 21 Non-smoker (~E) 13 No Disease (~D) 79 87 100 100 P( D / E ) P( D / ~ E ) 21% 13% 8% Absolute, rather than relative risk difference! Difference in proportions test Null hypothesis: difference in proportions = 0 Under the null, the groups have the same risk of heart disease (=overall risk in the study): The number of smokers that develop heart disease in your study follows a binomial distribution with N=100, p=.17 The number of non-smokers that develop heart disease in your study follows a binomial distribution with N=100, p=.17 Follows a normal because binomial can be approximated with normal Difference in proportions test Null hypothesis: The difference in proportions is 0. Z p p1 p2 p * (1 p ) p * (1 p ) n1 n2 Recall, variance of a proportion is p(1-p)/n n1 p1 n2 p2 (just average proportion ) n1 n2 p1 proportion in group 1 p2 proportion in group 2 n1 number in group 1 n2 number in group 2 Use average (or pooled) proportion in standard error formula, because under the null hypothesis, groups have equal proportions. Z-test applied here… 21 13 p .17 200 .08 .08 Z 1.51 .17 * .83 .17 * .83 .053 100 100 Corresponding two-sided p-value is .131. Corresponding 95% confidence interval… .08 1.96 * stderror .08 1.96 * (.053) .02 to .18 If the 95% confidence interval crosses the null value (here=0), then p>.05 OR, use computer simulation to make inferences… 1. In SAS, assume infinite population of smokers and non-smokers with equal disease risk, p=.17 (UNDER THE NULL!) 2. Use the random binomial function to randomly select n=100 smokers and n=100 non-smokers, each with p=.17 3. Calculate the observed difference in proportions. 4. Repeat this 1000 times (or some large number of times). 5. Observe the distribution of differences under the null hypothesis. Computer Simulation Results Empirical standard error is about 5.3% P-value from our simulation… We also got 82 results as small or smaller than –8%. When we ran this study 1000 times, by chance, we got 72 results as big or bigger than 8%. P-value From our simulation, we estimate the p-value to be: 154/1000 or .154 3. chi-square test of independence Smoker (E) Heart disease (D) 21 Non-smoker (~E) 13 No Disease (~D) 79 87 100 100 Null hypothesis: smoking and heart disease are independent What does it mean to be “independent” in stats? Under independence, P(A&B)=P(A)*P(B) In words the “joint probability” equals the product of the “marginal probabilities.” OR The probability of both A and B happening is equal to the probability of A times the probability of B. If smoking and heart disease are independent, then P(smoker&heart disease)=P(smoker)*P(heart disease) Calculate expected counts under independence… Smoker (E) Heart disease (D) 21 Non-smoker (~E) 13 No Disease (~D) 79 87 100 100 IF smoking and heart disease are independent THEN: P(HeartDisesae&Smoker)=P(HeartDisease)*P(Smoker) P(HeartDisease)=34/100=17% P(Smoker)=100/200=50% IF INDEPENDENT, then P(HeartDisease&Smoker) should be 8.5%; 8.5% of 200 = 17 Fill in the expected table… Smoker (E) Heart disease (D) 17 No Disease (~D) 83 100 Non-smoker (~E) Marginals are fixed! 17 34 83 156 100 Notice that the rest of the table is determined after you fill in 17 for cell A. There are no degrees of freedom left! (This table has only 1 degree of freedom). Compare expected and observed counts… Smoker (E) Heart disease (D) No Disease (~D) 17 Non-smoker (~E) 17 83 Smoker (E) expected 83 Heart disease (D) 21 Non-smoker (~E) 13 No Disease (~D) 79 87 observed Chi-Square test (observed - expected) 2 expected 2 1 2 2.25=1.5squared. The chi-square test produces exactly the square of the Z-test and the same p-value. (21 17) 2 (13 17) 2 (79 83) 2 (87 83) 2 2.25 17 17 83 83 Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(2-1)=1 Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom, indicates statistical significance. Here 2.25 not quite big enough—p=.131. Bonus material: The Chi-Square distribution: is sum of squared normal deviates df 2 d f Z 2 ; where Z ~ Normal(0,1 ) i 1 The expected value and variance of a chi-square: E(x)=df Var(x)=2(df) Case-control study example: You sample 50 stroke patients and 50 controls without stroke and ask about their smoking in the past. Possible study results: Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 Statistics for these data 1. Odds ratio (relative risk) 2. Difference in proportions exposed (absolute risk) 3. Chi-square What’s the risk ratio here? Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 Tricky: There is no risk ratio, because we cannot calculate the risk of disease!! The odds ratio… We cannot calculate a risk ratio from a case-control study. BUT, we can calculate a measure called the odds ratio… Odds vs. Risk If the risk is… ½ (50%) ¾ (75%) 1/10 (10%) 1/100 (1%) Then the odds are… 1:1 3:1 1:9 1:99 Note: An odds is always higher than its corresponding probability, unless the probability is 100%. The Odds Ratio (OR) Exposure (E) Disease (D) a No Disease (~D) c No Exposure (~E) b d Odds of exposure inThe theproportion cases of cases to controls Pare ( E set / Dby ) the investigator; they P (~ E therefore, / D) do not represent the risk P ( E / ~ofDdeveloping ) (probability) P (~ E / ~ D ) disease. Odds of exposure in the controls OR a+b=cases c+d=controls a ad b c bc d The Odds Ratio (OR) Odds of exposure in the cases OR P(E / D) This P (~ E / D ) expression is mathematically P ( E / ~ D ) equivalent to: P (~ E / ~ D ) Odds of exposure in the controls Backward from what we want… Odds of disease in the exposed P(D / E) P (~ D / E ) P(D /~E ) P (~ D / ~ E ) Odds of disease in the unexposed The direction of interest! Proof via Bayes’ Rule (optional) P( E / D) P(~ E / D) P( E / ~ D) P(~ E / ~ D) Odds of exposure in the cases Odds of exposure in the controls Bayes’ Rule P( D / E ) P( E ) P( D) P( D / ~ E ) P(~ E ) P( D) P(~ D / E ) P( E ) P(~ D) P(~ D / ~ E ) P(~ E ) P(~ D) P( D / E ) P(~ D / E ) P( D / ~ E ) P(~ D / ~ E ) = Odds of disease in the exposed What we want! Odds of disease in the unexposed The odds ratio Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 ad 15 * 42 OR 2.25 bc 35 * 8 Interpretation: there is a 2.25-fold higher odds of stroke in smokers vs. non-smokers. Inferences about the odds ratio… Does the sampling distribution follow a normal distribution? What is the standard error? Simulation… 1. In SAS, assume infinite population of cases and controls with equal proportion of smokers (exposure), p=.23 (UNDER THE NULL!) 2. Use the random binomial function to randomly select n=50 cases and n=50 controls each with p=.23 chance of being a smoker. 3. Calculate the observed odds ratio for the resulting 2x2 table. 4. Repeat this 1000 times (or some large number of times). 5. Observe the distribution of odds ratios under the null hypothesis. Properties of the OR (simulation) (50 cases/50 controls/23% exposed) Under the null, this is the expected variability of the sample ORnote the right skew Properties of the lnOR Normal! Properties of the lnOR From the simulation, can get the empirical standard error (~0.5) and p-valuE (~.10) Properties of the lnOR Or, in general, standard error = 1 1 1 1 a b c d Inferences about the ln(OR) Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 OR 2.25 ln( OR ) 0.81 Z ln( 2.25) 0 0.81 1.64 0.494 1 1 1 1 8 15 35 42 p=.10 Confidence interval… Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 95% CI ln OR 0.81 1.96 * 0.494 0.16,1.78 95% CI OR e .16 1.78 ,e 0.85,5.92 Final answer: 2.25 (0.85,5.92) 2. Difference in proportions exposed Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 50 50 P( E / D) P( E / ~ D) 15 / 50 8 / 50 30% 16% 14% 2. Difference in proportions exposed Z 14% 0% .14 1.67 .23 * .77 .23 * .77 .084 50 50 95% CI : 0.14 1.96 * .084 0.03 to .31 3. chi-square test of independence Smoker (E) Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 Expected count for cell A: proportion: 0.5*.23=.115 count: .115*100= 11.5 expected and observed counts… Smoker (E) Stroke (D) 11.5 Non-smoker (~E) 38.5 No Stroke (~D) 11.5 Smoker (E) expected 38.5 Stroke (D) 15 Non-smoker (~E) 35 No Stroke (~D) 8 42 observed Chi-Square test squared. 1 2 2.78=1.67- (15 11.5) 2 (8 11.5) 2 (35 38.5) 2 (42 38.5) 2 2.78 11.5 11.5 38.5 38.5 Not quite sufficient evidence to reject null…