Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Categorical Data Analysis July 22, 2004 Categorical data The t-test, ANOVA, and linear regression all assumed outcome variables that were continuous (normally distributed). Even their non-parametric equivalents assumed at least many levels of the outcome (discrete quantitative or ordinal). We haven’t discussed the case where the outcome variable is categorical. Types of Variables: a taxonomy Categorical binary 2 categories + Quantitative nominal ordinal discrete continuous discrete random variables more categories + order matters + numerical + uninterrupted Overview of statistical tests Independent variable=predictor Dependent variable=outcome e.g., BMD= pounds age amenorrheic (1/0) Continuous outcome Continuous predictors Binary predictor Types of variables to be analyzed Predictor (independent) variable/s Categorical Outcome (dependent) variable Continuous Statistical procedure or measure of association ANOVA Dichotomous Continuous T-test Continuous Continuous Simple linear regression Multivariate Continuous Multiple linear regression Categorical Categorical Dichotomous Dichotomous Multivariate Dichotomous Categorical Time-to-event Multivariate Time-to-event Chi-square test Odds ratio, Mantel-Haenszel OR, Relative risk, difference in proportions Logistic regression Kaplan-Meier curve/ logrank test Cox-proportional hazards model Types of variables to be analyzed Predictor (independent) variable/s Categorical Outcome (dependent) variable Continuous Statistical procedure or measure of association ANOVA Dichotomous Continuous T-test Continuous Continuous Simple linear regression Multivariate Continuous Multiple linear regression Categorical Categorical Chi-square test Dichotomous Dichotomous Odds ratio, Mantel-Haenszel OR, Relative risk, difference in proportions Multivariate Dichotomous Logistic regression Categorical Time-to-event Multivariate Time-to-event Kaplan-Meier curve/ logrank test Cox-proportional hazards model done Today and next week Last part of course Difference in proportions Example: You poll 50 people from random districts in Florida as they exit the polls on election day 2004. You also poll 50 people from random districts in Massachusetts. 49% of pollees in Florida say that they voted for Kerry, and 53% of pollees in Massachusetts say they voted for Kerry. Is there enough evidence to reject the null hypothesis that the states voted for Kerry in equal proportions? Null distribution of a difference in proportions Standard error of a proportion= Standard error can be estimated by= (still normally distributed) p(1 p) n pˆ (1 pˆ ) n Standard error of the difference of two proportions= pˆ1 (1 pˆ1 ) pˆ 2 (1 pˆ 2 ) or n1 n2 (n ) p (n2 ) p2 p(1 p) p(1 p) , where p 1 1 n1 n2 n1 n2 The variance of a difference is the sum of variances (as with difference in means). Analagous to pooled variance in the ttest Null distribution of a difference in proportions Difference of proportions p(1 p) p(1 p) ~ N ( p1 p2 , ) n1 n2 For our example, null distribution= pˆ1 pˆ 2 ~ N (0, 2 x .51(1 .51) .10) 50 Answer to Example We saw a difference of 4% between Florida and Massachusetts Null distribution predicts chance variation between the two states of 10%. P(our data/null distribution)=P(Z>.04/.10=.4)>.05 Not enough evidence to reject the null. Chi-square test for comparing proportions (of a categorical variable) between groups I. Chi-Square Test of Independence When both your predictor and outcome variables are categorical, they may be crossclassified in a contingency table and compared using a chi-square test of independence. A contingency table with R rows and C columns is an R x C contingency table. Example Asch, S.E. (1955). Opinions and social pressure. Scientific American, 193, 31-35. The Experiment A Subject volunteers to participate in a “visual perception study.” Everyone else in the room is actually a conspirator in the study (unbeknownst to the Subject). The “experimenter” reveals a pair of cards… The Task Cards Standard line Comparison lines A, B, and C The Experiment Everyone goes around the room and says which comparison line (A, B, or C) is correct; the true Subject always answers last – after hearing all the others’ answers. The first few times, the 7 “conspirators” give the correct answer. Then, they start purposely giving the (obviously) wrong answer. 75% of Subjects tested went along with the group’s consensus at least once. Further Results In a further experiment, group size (number of conspirators) was altered from 2-10. Does the group size alter the proportion of subjects who conform? The Chi-Square test Number of group members? Conformed? 2 4 6 8 10 Yes 20 50 75 60 30 No 80 50 25 40 70 Apparently, conformity less likely when less or more group members… 20 + 50 + 75 + 60 + 30 = 235 conformed out of 500 experiments. Overall likelihood of conforming = 235/500 = .47 Expected frequencies if no association between group size and conformity… Number of group members? Conformed? 2 4 6 8 10 Yes 47 47 47 47 47 No 53 53 53 53 53 Do observed and expected differ more than expected due to chance? Chi-Square test (observed - expected) 2 expected 2 (20 47) 2 (50 47) 2 (75 47) 2 (60 47) 2 (30 47) 2 4 47 47 47 47 47 (80 53) 2 (50 53) 2 (25 53) 2 (40 53) 2 (70 53) 2 85 53 53 53 53 53 2 Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4 Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom, indicates statistical significance. Here 85>>4. The Chi-Square distribution: is sum of squared normal deviates df 2 d f Z 2 ; where Z ~ Normal(0,1 ) i 1 The expected value and variance of a chisquare: E(x)=df Var(x)=2(df) Chi-Square test (observed - expected) 2 expected 2 (20 47) 2 (50 47) 2 (75 47) 2 (60 47) 2 (30 47) 2 4 47 47 47 47 47 (80 53) 2 (50 53) 2 (25 53) 2 (40 53) 2 (70 53) 2 85 53 53 53 53 53 2 Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4 Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom, indicates statistical significance. Here 85>>4. Caveat **When the sample size is very small in any cell (<5), Fischer’s exact test is used as an alternative to the chi-square test. Example of Fisher’s Exact Test Fisher’s “Tea-tasting experiment” Claim: Fisher’s colleague (call her “Cathy”) claimed that, when drinking tea, she could distinguish whether milk or tea was added to the cup first. To test her claim, Fisher designed an experiment in which she tasted 8 cups of tea (4 cups had milk poured first, 4 had tea poured first). Null hypothesis: Cathy’s guessing abilities are no better than chance. Alternatives hypotheses: Right-tail: She guesses right more than expected by chance. Left-tail: She guesses wrong more than expected by chance Fisher’s “Tea-tasting experiment” Experimental Results: Guess poured first Milk Tea Milk 3 1 4 Tea 1 3 4 Poured First Fisher’s Exact Test Step 1: Identify tables that are as extreme or more extreme than what actually happened: Here she identified 3 out of 4 of the milk-poured-first teas correctly. Is that good luck or real talent? The only way she could have done better is if she identified 4 of 4 correct. Guess poured first Milk Tea Poured First Milk 3 1 4 Tea 1 3 4 Guess poured first Milk Tea Milk 4 0 Tea 0 4 Poured First 4 4 Fisher’s Exact Test Step 2: Calculate the probability of the tables (assuming fixed marginals) Guess poured first Milk Tea Milk 3 1 Tea 1 3 Poured First 4 4 P(3) .229 4 3 4 1 8 4 Guess poured first Milk Tea Milk 4 0 Tea 0 4 Poured First 4 4 P(4) .014 4 4 4 0 8 4 Step 3: to get the left tail and right-tail p-values, consider the probability mass function: Probability mass function of X, where X= the number of correct identifications of the cups with milk-poured-first: P(4) .014 P(3) .229 P(2) .514 P (1) .229 .014 P(0) 4 4 4 0 8 4 4 3 4 1 8 4 4 2 4 2 8 4 4 1 4 3 8 4 4 0 4 4 8 4 “right-hand tail probability”: p=.243 “left-hand tail probability” (testing the null hypothesis that she’s systematically wrong): p=.986 SAS code and output for generating Fisher’s Exact statistics for 2x2 table Milk Tea Milk 3 1 4 Tea 1 3 4 data tea; input MilkFirst GuessedMilk Freq; datalines; 1 1 3 1 0 1 0 1 1 0 0 3 run; data tea; *Fix quirky reversal of SAS 2x2 tables; set tea; MilkFirst=1-MilkFirst; GuessedMilk=1-GuessedMilk;run; proc freq data=tea; tables MilkFirst*GuessedMilk /exact; weight freq;run; SAS output Statistics for Table of MilkFirst by GuessedMilk Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 2.0000 0.1573 Likelihood Ratio Chi-Square 1 2.0930 0.1480 Continuity Adj. Chi-Square 1 0.5000 0.4795 Mantel-Haenszel Chi-Square 1 1.7500 0.1859 Phi Coefficient 0.5000 Contingency Coefficient 0.4472 Cramer's V 0.5000 WARNING: 100% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 3 Left-sided Pr <= F 0.9857 Right-sided Pr >= F 0.2429 Table Probability (P) Two-sided Pr <= P 0.2286 0.4857 Sample Size = 8 Introduction to the 2x2 Table Introduction to the 2x2 Table Exposure (E) Disease (D) a No Exposure (~E) b No Disease (~D) c d a+c = P(E) b+d = P(~E) Marginal probability of exposure Marginal probability of disease a+b = P(D) c+d = P(~D) Cohort Studies Disease Exposed Target population Disease-free cohort Disease-free Disease Not Exposed Disease-free TIME The Risk Ratio, or Relative Risk (RR) Exposure (E) Disease (D) a No Exposure (~E) b No Disease (~D) c d a+c b+d risk to the exposed RR P(D / E ) P(D /~E) a /( ac) b /(bd ) risk to the unexposed Hypothetical Data Congestive Heart Failure No CHF High Systolic BP Normal BP 400 400 1100 2600 1500 3000 400 / 1500 RR 2.0 400 / 3000 Case-Control Studies Sample on disease status and ask retrospectively about exposures (for rare diseases) Marginal probabilities of exposure for cases and controls are valid. • Doesn’t require knowledge of the absolute risks of disease • For rare diseases, can approximate relative risk Case-Control Studies Exposed in past Disease (Cases) Target population Not exposed Exposed No Disease (Controls) Not Exposed The Odds Ratio (OR) Exposure (E) Disease (D) a = P (D& E) No Exposure (~E) b = P(D& ~E) No Disease (~D) c = P (~D&E) d = P (~D&~E) OR a c b d ad bc The Odds Ratio OR P( E / D) P (~ E / D ) P( E /~ D) P (~ E / ~ D ) P( D / E ) P (~ D / E ) 1 P( D /~ E ) P (~ D / ~ E ) 1 P ( D& E ) P ( D &~ E ) P (~ D & E ) P (~ D & ~ E ) Via Bayes’ Rule P( D / E ) P( D /~ E ) RR When disease is rare: P(~D) 1 “The Rare Disease Assumption” Properties of the OR (simulation) 6 5 P 4 e r c 3 e n t 2 1 0 0 0.35 0.7 1.05 1.4 1.75 2.1 Simulated Odds Ratio 2.45 2.8 3.15 3.5 Properties of the lnOR 10 Standard deviation = Standard deviation = 1 1 1 1 a b c d 8 P e r c e n t 6 4 2 0 -1.05 -0.75 -0.45 -0.15 0.15 0.45 lnOR 0.75 1.05 1.35 1.65 1.95 Hypothetical Data Smoker Non-smoker Lung Cancer 20 10 30 No lung cancer 6 24 30 (20)( 24) OR 8.0 (6)(10) 95% CI (8.0)e 1.96 1 1 1 1 20 6 10 24 1.96 , (8.0)e 1 1 1 1 20 6 10 24 Note that the size of the smallest 2x2 cell determines the magnitude of the variance (2.47 - 25.8) Example: Cell phones and brain tumors (cross-sectional data) Own a cell phone Don’t own a cell phone Brain tumor No brain tumor 5 347 352 3 88 91 8 435 453 5 3 .014; ptumor/ nophone .033 352 91 ˆ1 p ˆ2) 0 (p 8 ;p .018 453 ( p )(1 p ) ( p )(1 p ) n1 n2 ptumor/ cellphone Z Z (.014 .033) (.018 )(.982 ) (.018 )(.982 ) 352 91 .019 1.22 .0156 Same data, but use Chi-square test orBrain Fischer’s exact tumor No brain tumor Own 5 347 352 Don’t own 3 88 91 8 8 352 .018; pcellphone 435 .777 453 453 ptumor xpcellphone .018 * .777 .014 ptumor 453 Expected in cell a .014 * 453 6.3; 1.7 in cell c; 345.7 in cell b; 89.3 in cell d (R-1 )*(C-1 ) 1*1 1 df 2 1 (8 - 6.3) 2 (3 - 1.7) 2 (89.3 - 88) 2 (347 - 345.7) 2 1.48 6.3 1.7 89.3 345 .7 NS note :Z 2 1.22 2 1.48 Same data, but use Odds Ratio Own a cell phone Don’t own a cell phone Brain tumor No brain tumor 5 347 352 3 88 91 8 435 453 5 * 88 OR .423 3 * 347 lnOR - 0 Z 1 1 1 1 a b c d .86 1.16; p .05 .74 1 1 1 1 5 347 3 88 ln(. 423)