Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HSRP 734: Advanced Statistical Methods May 29, 2008 Finish talking about Association Measures: Odds Ratio • OR=2 of Disease for Exposed vs. Not exposed • What is the interpretation? • “Exposed patients have twice the odds of disease versus patients that were not exposed.” Finish talking about Association Measures: Relative Risk • RR=2.5 of Disease for Exposed vs. Not exposed • What is the interpretation? • “Exposed patients are 2.5 times as likely to have the disease versus patients that were not exposed.” Finish talking about Association Measures • OR is not close to RR • Unless Pr(disease) for Exposed, Not exposed low • “Rare” disease Finish talking about Association Measures • Confidence intervals for Odds Ratio • Confidence intervals for Relative Risk Measures of Disease Association Disease D No Disease D Total Exposed E Not Exposed E Total a b n1 c d n0 m1 m0 N Confidence Interval for Odds Ratio Confidence limits are based on the sampling distribution of pˆ 1 (1 pˆ ) ad 1 ln ln ˆ p bc 2 ˆ ( 1 p ) 2 which is normal or approximately normal with ad Mean ln bc and 1 1 1 1 Variance a b c d Confidence Interval for Odds Ratio ad 1. Calculate M ln OR ln bc 1 1 1 1 2. Calculate S a b c d 3. Calculate 95% CI e M 1.96S *Use if N>25 Confidence Interval for Risk Ratio Confidence limits are based on the sampling distribution of pˆ1 a /( a b) ln ln c /( c d ) pˆ 2 which is normal or approximately normal with and p1 Mean ln p2 b d Variance a ( a b ) c (c d ) Confidence Interval for Risk Ratio a a b 1. Calculate M ln RR ln c c d b d 2. Calculate S aa b cc d 3. Calculate 95% CI e M 1.96S *Use if N>25 SAS Enterprise: or_rr.sas7bdat SAS websites • Online help: http://support.sas.com/onlinedoc/913/docM ainpage.jsp • UCLA: http://www.ats.ucla.edu/stat/SAS/ • SAS SUGI: http://support.sas.com/events/sasglobalforu m/previous/index.html Categorical Data Analysis 1. Understand the Multinomial probability mass function 2. Compute Goodness-of-fit tests and chi-squared tests for association 3. Test for association in the presence of a possibly confounding third factor (e.g., disease versus exposure from 3 sites) Categorical Data Analysis • Motivation – How do we estimate and test the magnitude of posited relationship when the outcome of interest is categorical? – e.g., An international study examines the relationship between age at first birth and the development of breast cancer • Age = categorized into age groups Categorical Data Analysis <20 20-24 25-29 30-34 >=35 Total Cancer 320 1206 1011 463 220 3220 No cancer 1422 4432 2893 1092 406 10245 Total 1742 5638 3904 1555 626 13465 Categorical Data Analysis • Research question – Is there a relationship between age at first birth and Cancer status? • Better to convert the table into percentages (easier to see) • Turns out that there is a significant relationship (p<0.001) Categorical Data Analysis • Statistical techniques involve – Probability distribution for categorical data – Tests for relationship in a RxC table R = # of Rows in Table C = # of Columns in Table Probability Distribution for Categorical Outcomes • Fun for Friday night: – Go home and flip a quarter 10,000 times. Determine if there is evidence that one side is falling down more. Probability Distributions for Categorical Data – Bernoulli (1 toss of a coin, outcome=H,T) – Binomial (10 tosses of a coin, outcome=0,1,2..,10 heads) – Multinomial (throw 10 balls into 4 pigeon holes ABCD, outcome= (3A,2B,1C,4D)) Why use multinomial for testing? • Relationship between 2 categorical variables – RxC table analysis – Based on multinomial distribution Why use multinomial for testing? • Example: 2 level exposure status (Exposed, Not exposed), 3 level outcome (severe, mild, no disease) – Treat 2x3=6 outcomes as categorical or a multinomial distribution with 6 pigeon holes – The expected probability of the pigeon holes are specified under some kind of assumptions (e.g., independence) Level of Measurement – Categorical response • dichotomous • ordinal (>2 categories, ordered) • nominal (>2 categories, not ordered) – Dichotomous use Binomial distribution – Ordinal, Nominal use Multinomial distribution Multinomial Distribution • Multinomial experiment: 1. Experiment consists of n identical and independent trials 2. Each trial results in one of K outcomes 3. Let pi be the probability of outcome i a. Each pi remains constant for each experiment K b. pi 1 i 1 • The pmf for k outcomes is: n! P ( n1 , n2 ,..., nk ) p1n1 p2n2 ... pknk n1! n2 !...nk ! • Notes: k k i 1 i 1 pi 1; ni n; E (ni ) n pi Example of a Multinomial Experiment Consider an unfair die and 6 tosses: Let 1 2 3 4 5 Pi ni 6 0.3 0.1 0.1 0.1 0.0 0.4 2 0 1 1 0 2 Find the probability of this outcome Pr( 2,0,1,1,0,2) 6! 2 0 1 1 0 2 (0.3 )(0.1 )(0.1 )(0.1 )(0.0 )(0.4 ) 2! 0!1!1! 0! 2! 720 0.000144 0.02592 4 Simple Multinomial Experiments Classical example: Mendel Sample from the second generation of seeds resulting from crossing yellow round peas and green wrinkled peas (N=556) Yellow Green Round Wrinkled Round Wrinkled 315 101 108 32 Mendel’s Laws of Inheritance suggest that we should expect the following ratios: 9/16, 3/16, 3/16, 1/16 For N = 556, the expected number of each outcome is: E(YR) = 556 x 9/16 = 312.75 E(YW) = 556 x 3/16 = 104.25 E(GR) = 556 x 3/16 = 104.25 E(GW) = 556 x 1/16 = 34.75 Yellow Green Round Wrinkled Round Wrinkled 315 (312.75) 101 (104.25) 108 (104.25) 32 (34.75) (Expected counts) Multinomial distribution • The observed cell counts are not identical to the expected cell counts • Under the assumption of a multinomial model with the stated probabilities, how might we determine how unlikely it is to observe these data? Chi-square GOF Test • Hypothesis: observed cell counts are consistent with the multinomial probabilities • Theoretical result k (Observedi Expectedi ) 2 (ni Npi ) 2 dist 2 k 1 Expected Np i 1 i 1 i i k • • Require that expected cell counts not too small Expected counts > 5. Chi-square distribution • Remarks about Chi-squared distribution: 1.Nonsymmetric 2.Strictly positive 3.Different chi-squared distribution for each df. Chi-square GOF Test • Applying this test to Mendel’s peas example yields Yellow Green Round Wrinkled Round Wrinkled Observed (ni) 315 101 108 32 Expected (Npi) 312.75 104.25 104.25 34.75 • H0: pYR = 9/16, pYW = 3/16, pGR = 3/16, pGW = 1/16 • H1: at least one pi differs from hypothesized value Chi-square GOF Test 2 2 k ( Observed Expected ) ( n Np ) i i i 2 i Expectedi Npi i 1 i 1 k 2 2 2 2 315 312.75 101 104.25 108 104.25 32 34.75 312.75 104.25 0.47 104.25 34.75 Chi-square GOF Test • Therefore, we observed 2 = 0.47 from a multinomial experiment with k = 4. Thus, df = k-1 = 3. For a = 0.05, 12a ,k 1 120.05, 41 02.95,3 7.81 • Thus, the observed chi-squared statistic is not greater than the critical value for a = 0.05 and df = 3. • We fail to find evidence that these data depart from the hypothesized probabilities. i.e., model fits well to data Testing association in 2x2 table • This method translates to testing crosstabulation tables for RxC cases • Here the cells are formed by crossclassification of 2 variables • Null hypothesis is the 2 variables are independent • Simplest case : 2x2 table Testing association in 2x2 table • Testing for independence or no association • Similar idea to checking goodness-of-fit – Compare what to see to what you hypothesized to be true – You did, in fact, hypothesize “independence” Basic Inference for 2x2 Tables • 2x2 Contingency Table Column Levels Row Levels 1 2 Total 1 n11 n12 n1+ 2 n21 n22 n2+ Total n+1 n+2 N Chi-square GOF Test for 2x2 Tables • H0: There is no association between row and columns • Under H0, the expected cell counts are the product of the marginal probabilities and the sample size. Why? ni n j TotalROW * TotalCOL EXPECTEDij N N N TotalOVERALL • The classic Pearson’s chi-squared test of independence 2 2 i 1 j 1 (Observedij Expectedij )2 Expectedij dist 12 • df = (2-1) x (2-1) = 1 • Conservatively, we require EXPECTEDij ≥ 5 for all i, j Other Tests for 2x2 Tables • Two alternative tests – Yate’s continuity corrected chi-square statistic – Mantel-Haenszel chi-square statistic • For sufficiently large sample size, all three Chi-squared statistics are approximately equal and all have a Chi-squared distribution with 1 df When to use Chi-square vs. Fisher’s Exact • When the expected cell counts are less than 5, it is better to use the Fisher’s exact test. Summary of the Use of 2 test • Test of goodness-of-fit Determine whether or not a sample of observed values of some random variable is compatible with the hypothesis that the sample was drawn from a population with a specified distributional form (e.g., specified probabilities of certain events) Summary of the Use of 2 test • Test of independence Test the null hypothesis that two criteria of classification (variables) are independent Summary of the Use of 2 test • Test of homogeneity Test the null hypothesis that the samples are drawn from populations that are homogeneous with respect to some factor (i.e., no association between group and factor) Summary of the Use of 2 test • Could consider this test as answering: “Are the Row factor and Column factor associated?” Categorical Data Analysis • Ideas of multinomial and chi-squared test generalize to testing RxC association and RxCxK association • Example: – 2 exposure status, 2 disease status, 3 sites – 2x2x3 association analysis Test of General Association (R x C Table) • Consider a study designed to test whether there exists an association between political party affiliation and residency within specific counties County Party Buncombe Transylvania Halifax Democrat 221 160 360 Independent 200 291 160 Republican 208 106 316 • Notation for general RxC table Group 1 2 … r Total Response Variable Categories 1 2 … c n11 n12 … n1c n21 n22 … n2c … … … … nr1 nr2 … nrc n+1 n+2 … n+c Total n1+_ n2+ … nr+ N Test of General Association • H0: There is no association between rows and columns H1: There exists a dependence between rows and columns • Under H0,the expected cell counts are the product of the corresponding marginal probabilities and the sample size. ni n j TotalROW * TotalCOL Expectedij N N N TotalOVERALL • The classic Pearson’s chi squared test of independence r c 2 i 1 j 1 Observedij Expectedij 2 dist 2 Expectedij ( r 1)(c 1) SAS Enterprise: chisq.sas7bdat Mantel-Haenszel test • Often, there are other factors in a RxC test • Mantel-Haenszel test (or Cochran Mantel Haenzsel CMH) can be used for controlling for “nuisance” factors • Typically used for rxcx2 table – e.g., 2x2x2 cross classification – e.g., Association between disease status and exposure controlling for age group (strata) Stratified Analysis • Examples of commonly used strata • • • • Age group Gender Study site (hospital, country) ethnic group Stratified Analysis • Myocardial infarction and anticoagulant use by Coronary Care Unit AC use Stratum 1 No CCU+ Yes MI 43 No MI 56 20 90 Total Stratum 2 CCU- No Yes Total Total 209 137 32 437 341 947 Stratified Analysis • Idea: test for an association while controlling for CCU effects • Denote the counts from the first cell within the hth subtable as nh11, • Construct the CMH test of association controlling for CCU Stratified Analysis • Test assumes the direction of effect within each table is the same • The Cochran-Mantel-Haenszel approach partially removes the confounding influences of the explanatory variable (e.g., CCU) • May improve power Mantel-Haenszel Test • The expected value of nh11 for h = 1,2,…,g is E ( nh11 ) mh11 nh1 nh 1 nh and the variance of nh11 nh1 nh 2 nh 1nh 2 Var nh11 nh2 nh 1 This leads to the Cochrane-Mantel-Haenszel test 2 nh11 mh11 h 1 h 1 dist 2 1 g Varnh11 g g h 1 Direction of effects across Strata • Note that if directions of conditional ORs are not the same, discrepancies between observed and expected from different strata may cancel out one another • Lead to poor power and biased result MH “Pooled” Odds Ratio ng 11ng 22 n111n122 n211n222 nh11nh 22 ... n1 n2 ng h 1 nh g nh 21nh12 ng 21ng 12 n121n112 n221n212 ... h 1 nh n1 n2 ng g ORMH MH test decision list • Z = strata of potential confounder -> If ORc ≈ (ORZ=1 ≈ ORZ=2 ≈…) Z is not a confounder, report crude OR (ORc) -> If ORc ≠ (ORZ=1 ≈ ORZ=2 ≈…) Z is a confounder, report adjusted OR (ORMH) -> If ORZ=1 ≠ ORZ=2 ≠ … Z is an effect modifier, report strata specific OR’s (don’t adjust!) Breslow Day test • (More formal approach) Can also test for homogeneity of odds ratio across strata • If Breslow Day test is significant => odds ratios within strata are not homogeneous. Thus, => ORMH would be inappropriate! SAS Enterprise: cmh.sas7bdat Results from cmh.sas7bdat ORcrude ORcenter1 ORcenter2 ORMH = 3.76 (2.01, 7.05) = 4.01 (1.67, 9.66) = 4.05 (1.55, 10.60) = 4.03 (2.11, 7.71) Breslow-Day p-value = 0.99 MH Chi-square = 18.41, p-value < 0.0001 Take home messages • Multinomial and the Chi-square test are the “workhorse” for testing of goodness-of-fit • Idea is to compare expected counts (calculated from a pre-determined set of probabilities) and the observed counts • The same idea can be applied to testing statistical assumptions such as no association • CMH test is for testing association when a confounding effect (strata) may be present For Next Class 6/5 • HW #1 key posted • HW #2 will be due • Read Kleinbaum Ch. 1,2