Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Categorical Data Analysis PGRM 14 Statistics in Science What is categorical data? The measurement scale for the response consists of a number of categories Variable Measurement Scale Farm system Dairy, Beef, Tillage etc. Mortality Dead, alive Very soft, Soft, Hard, Very hard 0, 1, 2, 3 and >3 Food texture Litter size Statistics in Science Data Analysis considered: • Response variable(s) is categorical • Explanatory variable(s) may be categorical or continuous Example: Does Post-operative survival (categorical response) depend on the explanatory variables? Sex (categorical) Age (continuous) Example: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system. Farm system (categorical) Attitude to EU (categorical/ordinal)? (Two response variables - no explanatory variables) Statistics in Science Could one of these be regarded as explanatory? Measurement scales for categorical data Nominal - no underlying order Variable Measurement Scale Farm system Weed Species Dairy, Beef, Tillage etc. Stellaria media, Poa annua, etc. Ordinal - underlying order in the scale Variable Food texture Disease diagnosis Education Measurement Scale Very soft, Soft, Hard, Very hard Very likely, Likely, Unlikely Primary, Secondary, Tertiary Interval - underlying numerical distance between scale points Statistics in Science Variable Measurement Scale Litter size 0, 1, 2, 3 and >3 Age class <1, 1-2, 2-3.5, 3.5-5, >5 Education years in education Tables reporting categoricaldata 1-, 2- & 3-way Statistics in Science Tables reporting count data: single level Example: A geneticist carries out a crossing experiment between F1 hybrids of a wild type and a mutant genotype and obtains an F2 progeny of 90 offspring with the following characteristics. Wild Type Mutant Total 80 10 90 Evidence that a wild type is dominant, giving on average 8:1 offspring phenotype in its favour? Statistics in Science Tables for count data: two-way Example: A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive in each group were counted. antiserum control Total Statistics in Science Outcome Dead Alive 19 65 18 22 37 87 Association between mortality and treatment? Total 84 40 124 % dead 23 45 Tables for count data: two-way Example (Snedecor & Cochran): The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate. Concentration of sodium oleate (%) Dead Alive Total % Dead 0.65 1.10 1.6 2.1 Total 55 22 77 71.4 62 13 75 82.7 100 12 112 89.3 72 5 77 93.5 289 52 341 84.8 • Has the higher concentration given a significantly different percentage kill? Statistics in Science • Is there a relationship between concentration and mortality? Is this the relationship? Note: categorical response interval categorical explanatory variable ? Statistics in Science Tables for count data: two-way Example (Cornfield 1962) BP CHD No CHD Total % CHD Blood pressure (BP) was measured on a sample of males <117 3 153 156 1.9 aged 40-59, who were also 117 - 126 17 235 252 6.7 classified by whether they developed coronary heart 127 - 136 12 272 284 4.2 disease (CHD) in a 6-year 137 - 146 16 255 271 5.9 follow-up period. BP: interval categorical variable in 8 classes CHD: CHD or No-CHD 147 - 156 12 127 139 8.6 157 - 166 8 77 85 9.4 167 - 186 16 83 99 16.2 >186 8 35 43 18.6 Total 92 1237 1329 1.Is the incidence of CHD independent of BP? Statistics in Science 2.Is there a simple relationship between the probability of CHD and the level of BP? CHD v BP relationship Statistics in Science 3-way table Example: Grouped binomial (response has 2 categories) data - patterns of psychotropic drug consumption in a sample from West London (Murray et al 1981, Psy Med 11,551-60) Statistics in Science Sex Age Group Psych. case On drugs Total M M M M M M M M M M F F F F F F F F F F 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 No No No No No Yes Yes Yes Yes Yes No No No No No Yes Yes Yes Yes Yes 9 16 38 26 9 12 16 31 16 10 12 42 96 52 30 33 47 71 45 21 531 500 644 275 90 171 125 121 56 26 588 596 765 327 179 210 189 242 98 60 Non-tabulated data Example: Individual Legousia plants were monitored in an experiment to see whether they survived after 3 months. Survived -yes is scored 1 Survived -no scored 0. Also recorded were: CO2 treatment – 2 levels low and high Density of Legousia Density of companion species Height of the plant (mm)two weeks after planting. Statistics in Science Most individuals will have a unique profile in these three additional variables and so tabulation of the data by them is not feasible. The individual data is presented Non-tabulated data Density Subject Surv CO2 Ht Leg. Comp 1 0 L 35 20 30 2 1 L 68 22 27 3 1 H 43 16 33 4 0 L 27 4 16 … … … … … … … … … … … … Response Statistics in Science 1. Is survival related to the explanatory variables: CO2, Height, density-self, density-companions? 2. Can the probability of survival be predicted from the subject’s profile? Fixed and non-fixed margins • One margin fixed: Samples of fixed size are selected for one or more categories and individuals are classified by the other category(s). • No margin fixed: Individuals in a single sample are simultaneously classified by several categorical variables. Difference between these depends on the experimental design and how this specified the data should be collected. Method of analysis is the same. Statistics in Science Asking the right question • Data summarized by counts • Questions usually relate to %s (equivalently proportions) Statistics in Science Hypotheses for Categorical Data • Categorical data is summarised by counting individuals falling into the various combinations of categories • Hypotheses relate to: the probability of an individual being in a particular category • These probabilities are estimated by the observed proportions in the data • Using a sample proportion, p, from a sample of size n, to estimate a population proportion the standard error is √(p(1 – p)/n) eg with p = 0.5, n = 1100, 2×SE = 0.03 the often mentioned 3% margin of error Statistics in Science Example Outcome antiserum control Total Dead 19 18 37 Alive 65 22 87 Total % dead 84 23 40 45 124 Does % dead depend on antiserum? Equivalently: 1. Is there an association between mortality and antiserum? Statistics in Science 2. Is mortality independent of anitserum? Example Outcome antiserum control Total Dead 19 18 37 Alive 65 22 87 Total % dead 84 23 40 45 124 • As usual we set up a null hypothesis and measure the extent to which the data conflicts with this • Here H0: prob of death for anti = prob of death for control • equivalently H0: Statistics in Science – no association between mortality and antiserum – Mortality and antiserum are independent Example Outcome antiserum control Total Dead Alive Total % dead 19 65 84 23 18 22 40 45 37 87 124 Expected counts when H0 is true: The overall % dead (37/124) would apply to antiserum & control For the 84 antiserum this would give (84×37)/124 dead and (84×87)/124 alive For the 40 control this would give (40×37)/124 dead and (40×87)/124 alive Statistics in Science E = (row total)(column total)/(table total) Observed and expected counts Outcome Dead Alive antiserum 19 65 control 18 22 Total 37 87 Total % dead 84 23 40 45 124 Outcome Dead Alive antiserum 25.1 58.9 control 11.9 28.1 Total 37 87 Total % dead 84 29.9 Expected 40 29.8 124 Note: some rounding error Statistics in Science Observed Chi-squared statistic : X2 • X2 measures difference between observed counts, O, and expected (when H0 holds) counts, E • If LARGE provides evidence against H0, ie evidence for an association (dependence) of mortality on anitserum. • X2 = ∑(O – E)2/E • Here SAS/FREQ gives: X2 = 6.48 p = Prob(X2 > 6.48 when H0 is true) = 0.0109 • Conclusion: there is evidence (p < 0.05) that mortality depends on antiserum Statistics in Science Practical Exercise Use Excel to calculate X2 and p Lab Session 5 exercise 5.1 (a) Statistics in Science SAS/FREQ OUTPUT Description of cell contents X2 = ∑(O – E)2/E O = Frequency E = Expected Row Percents make most sense here (% alive/dead in each antiserum group) Statistics in Science Table of antiserum by dead antiserum dead Frequency Expected Row Pct 0 1 Total antiserum 65 19 58.935 25.065 77.38 22.62 84 control 22 18 28.065 11.935 55.00 45.00 40 Total 87 37 124 SAS/FREQ OUTPUT DF = (r–1)×(c-1) X2 = ∑(O – E)2/E Statistic Ignore! Statistics in Science DF Value Prob Chi-Square 1 6.4833 0.0109 Likelihood Ratio Chi-Square 1 6.2846 0.0122 Continuity Adj. Chi-Square 1 5.4583 0.0195 Mantel-Haenszel Chi-Square 1 6.4310 0.0112 Phi Coefficient 0.2287 Contingency Coefficient 0.2229 Cramer's V 0.2287 P = 0.001 with X2 = 6.48 Area 0.05 Area 0.001 68% values < 1 (not shown) Statistics in Science 6.48 Aphid example (SAS/FREQ OUTPUT) status(Outcome) Frequency Expected Cell Chi-Square Col Pct Alive Dead Total Table of status by conc conc(Sodium oleate concentration (%)) 0.65 22 11.742 8.9617 28.57 1.1 13 11.437 0.2136 17.33 1.6 12 17.079 1.5105 10.71 2.1 5 11.742 3.8711 6.49 Total 52 55 65.258 1.6125 71.43 77 62 63.563 0.0384 82.27 75 100 94.921 0.2718 89.29 112 72 65.258 0.6965 93.51 77 289 341 X2 = 17.18 Note the largest contributions (O – E)2/E p = 0.0007 (3 df) to X2 (8.96 & 3.87) are in top corners Statistics in Science Locating the concentration effect Table of Outcome by Sodium Table of Outcome by Sodium Outcome Sodium oleate(%) Outcome Sodium oleate(%) Total Total Frequency Frequency 1.6 2.1 0.65 1.1 Expected Expected Alive Dead Total X2 = 2.71 p = 0.10 Statistics in Science 22 13 28.57 17.33 35 Alive 55 62 117 Dead 71.43 82.67 77 75 152 Total X2 = 0.99 p = 0.32 12 10.71 5 6.49 100 72 89.29 93.51 112 77 17 172 189 Locating the concentration effect Table of Outcome by Sodium Sodium Outcome oleate(%) Frequency Col Pct <1.5% >1.5% Total Alive 52 35 17 23.03 8.99 Dead 117 172 289 76.97 91.01 Total 152 189 341 X2 = 12.83 p = 0.0003 Statistics in Science SAS – data format for FREQ procedure Concentration of sodium oleate (%) Dead Alive Total % Dead 0.65 55 22 77 71.4 1.10 62 13 75 82.7 2 cols identify the cell Final column is the ‘response’ – the frequency count for the cell Statistics in Science 1.6 100 12 112 89.3 2.1 72 5 77 93.5 Total 289 52 341 84.8 Conc status number 0.65 d 55 0.65 a 22 1.10 d 62 1.10 a 13 1.60 d 100 1.60 a 12 2.10 d 72 2.10 a 5 Validity of chi-squared (2) test • Test is based on an approximation leading to use of the 2 distribution to calculate p-values • With several DF and E 5 approximation is ok • If E < 1 in any cell approximation may be bad • With a number of cells in the table perhaps a third or quarter can have E between 1 & 5 without serious departures from 2 based p-values. (PGRM pg 14-11) • In cases where good approximation is in doubt use Fisher’s exact test (SAS/FREQ tables option exact) Statistics in Science Code: SAS/FREQ proc freq data = conc; weight number; tables status*conc / chisq cellchi2 expected norow nopercent nocum; quit; Statistics in Science Option To Do chisq Test statistics (chi-squared etc) cellchi2 Contribution to X2 from each cell expected Expected values for each cell norow nopercent Omit row/overall percentages nocum Omit cumulative frequencies Practical Exercise SAS/FREQ procedure Lab Session 5 exercise 5.1 (b) – (d) Statistics in Science Logistic Regression Statistics in Science Is this the relationship? Note: categorical response interval categorical explanatory variable ? Statistics in Science Why logistic and not just 2? • For sparse data (eg where individuals will have unique profiles) • With many categorical explanatory variables • With quantitative explanatory variables In the case of a continuous response we have looked to see if the mean, , can be expressed as = a + bx With categorical data we want an expression for p (the probability of the response in one of the 2 response categories) but p = a + bx may give values outside the range 0 to 1! eg p = 0.1 + 0.2x gives p = 1.1 for x = 5 Statistics in Science A solution: TRANSFORM • Use the transformation: p = exp(a + bx)/(1 + exp(a + bx)) • i.e. log(p/(1 – p)) = a + bx log(Odds) = a + bx where Odds = p/(1 – p) Note: exp(x) = ex Plot is for: a = 0, b = 1 LOGIT: logit(p) = log(p/(1-p)) Statistics in Science SAS/GPLOT logit(p) = −0.119 + 1.25 conc Logistic Estimate of Death Probability p 1.0 0.9 0.8 0.7 0.6 0.6 Statistics in Science 1.0 1.4 Sodium oleate (%) 1.8 2.2 LD50 – lethal dose for 50% p = 0.5 p /(1 – p) = 1 logit(p) = 0 (since log(1) = 0, WNF!) 0 = −0.119 + 1.25 conc conc = 0.119/1.25 = 0.095 Odd Ratio (OR) log(a) – log(b) = log(a/b) Increasing conc by 1% increases logit(p) by 1.25 log(Odds2) – log(Odds1) = 1.25 log(OR) = 1.25 Statistics in Science OR = exp(1.25) = 3.49 SAS/GENMOD conc dead total 0.65 1.10 1.60 2.10 53 57 95 73 77 75 112 77 proc genmod data = log; model dead/total = conc / pred link = logit dist = binomial; output out = p predicted = p; run; Term Function dead/total the proportion to be estimated conc the explanatory variable pred include predicted p’s in OUTPUT link = logit for modelling log(p/(1-p)) the log(ODDS) dist = binomial the data consists of counts out of a total out = p predicted = p Statistics in Science output will also go to a data set work.p in work.p a column named p will contain predicted values Practical Exercise SAS/GENMOD of Logistic Regression Lab Session 5 exercise 5.2 (a) – (g) Statistics in Science Modelling needs biological insight! Statistics in Science Stability analysis (Ex 2 pg 14-15) Heights, diameter and whether they fell over were recorded for 545 plants. Aim: model the probability of stability (not falling over) as a function of height an diameter. diameter height stable n .0016 0.057 1 1 Statistics in Science .0018 0.084 0 1 .0018 0.221 0 1 .0018 0.038 1 1 .0019 0.058 1 1 .0019 0.067 1 1 … … … … Explanatory terms Model 1: h d h2 d2 hd hopefully high order terms will not be needed! Model 2: h/d2 biologist suggests this! Model 1: h, d, h2, d2, hd Analysis Of Parameter Estimates Standard Parameter DF Estimate Error Intercept 1 -5.3801 0.9402 Wald 95% Confidence Limits Chi-Square Pr > ChiSq -7.2228 -3.5374 32.75 <.0001 height 1 -39.1639 4.1510 -47.2998 -31.0280 89.01 <.0001 diameter 1 4958.358 654.0395 3676.464 6240.252 57.47 <.0001 h2 1 10.0396 5.0747 0.0934 19.9859 3.91 0.0479 d2 1 -560913 120280.4 -796659 -325168 21.75 <.0001 hd 1 4206.787 1502.453 1262.033 7151.540 7.84 0.0051 Scale 0 1.0000 0.0000 1.0000 How can I describe this! Statistics in Science 1.0000 Model 2: h/d2 Parameter Intercept Analysis Of Parameter Estimates Wald 95% Standard Confidence DF Estimate Error Limits Chi-Square 1 3.3235 0.3212 2.6940 3.9529 107.09 h_d2 1 -1.7884 Scale 0 1.0000 0.1583 -2.0987 -1.4780 0.0000 1.0000 1.0000 Can understand & even plot this! Statistics in Science 127.56 Pr > ChiSq <.0001 <.0001 SAS/GRAPH But! Statistics in Science Linear v Quadratic in x = h/d2 ? Statistics in Science Finally! Modelling counts Statistics in Science Poisson Regression For count data - where eg we count all – not a subset out of a total To estimate the mean, μ, and its relationship with an explanatory variable x use a log link (usually): log(μ) = a + bx ie μ = exp(a + bx) (which will be >0) = ea ebx SAS/GENMOD Statistics in Science model count = x / link = log distribution = poisson; Example: Horseshoe crabs & satellites Each female crab had an attached male (in her nest) & other males (satellites) residing nearby. • Data recorded – No satellites (response) – Color (light medium, medium, dark medium, dark) – Spine condition (both good, one worn/broken, both worn/broken) – Carapace width (cm) – Weight (kg) • Poisson Models: – Log link: log(μ) = a + bx – Identity link: μ = a + bx Statistics in Science Effect of width and colour Statistics in Science Grouping weight & number values Statistics in Science Variation in no. satellites Statistics in Science Practical exercise SAS/GENMOD for Poisson Regression Lab Session 5 Exercise 5.3 (a) – (e) Statistics in Science