Download Slide_Show

hwu F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R) hwu Examples Single classifications (1-13) Two-way classifications (14-27) Three-way classifications (28-32) hwu Example 1 Eye colours Colour A B C D Frequency observed 89 66 60 85 hwu Example 2 Prussian cavalry deaths (a)Numbers killed in each unit in each year - frequency table Number 0 killed Frequency 144 observed 1 2 3 4 5 Total 91 32 11 2 0 280 hwu Example 2 Prussian cavalry deaths (b) Numbers killed in each unit in each year – raw data 0010020000 .................. .....0 0 0 2 0 1 0 1 2 0 1 . . . . . . . . . . . . . . . . . . . . . . . .0 ….. ….. 300100210010010011201011 hwu Example 2 Prussian cavalry deaths (c) Total numbers killed each year 1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘94 3 5 7 9 10 18 6 14 11 9 5 11 15 6 11 17 12 15 8 4 hwu Example 4 Political views 1 2 (very L) 46 179 3 196 4 5 (centre) 559 232 6 7 (very R) Don’t Know Total 150 35 93 1490 hwu Example 7 Vehicle repair visits Number of visits 0 1 2 3 4 5 Frequency 295 190 53 5 5 2 observed 6 Total 0 550 hwu Example 15 Patients in clinical trial Drug Placebo Total Side-effects 15 4 19 No side-effects 35 46 81 Total 50 50 100 hwu §1 INTRODUCTION Data are counts/frequencies (not measurements) Categories (explanatory variable) Distribution in the cells (response) Frequency distribution Single classifications Two-way classifications hwu Illustration 1.1 B: Cause of death A: Smoking status Cancer Other Smoker 30 20 Not smoker 15 35 hwu Data may arise as Bernoulli/binomial data (2 outcomes) Multinomial data (more than 2 outcomes) Poisson data [+ Negative binomial data – the version with range x = 0,1,2, …] hwu §2 POISSON PROCESS AND ASSOCIATED DISTRIBUTIONS hwu 2.1 Bernoulli trials and related distributions Number of successes – binomial distribution [Time before kth success – negative binomial distribution Time to first success – geometric distribution] Conditional distribution of success times hwu 2.2 Poisson process and related distributions    time  hwu Poisson process with rate λ Number of events in a time interval of length t, Nt , has a Poisson distribution with mean t PN t  n   e  t t  n n! , n  0, 1, 2, ... hwu Poisson process with rate λ Inter-event time, T, has an exponential distribution with parameter  (mean 1/) f  t   e , t  0  t hwu Conditional distribution of number of events given n events in time (0,t)  how many in time (0,s) (s < t)?  hwu Conditional distribution of number of events given n events in time (0,t)  how many in time (0,s) (s < t)?  Answer Ns|Nt = n ~ B(n,s/t) hwu Splitting into subprocesses    time  hwu 0 20 40 N 60 80 # events 100 Realisation of a Poisson process 0 10 30 20 t 40 50 time hwu X ~ Pn(), Y ~ Pn() X,Y independent then we know X + Y ~ Pn( +) Given X + Y = n, what is distribution of X? hwu X ~ Pn(), Y ~ Pn() X,Y independent then we know X + Y ~ Pn(+) Given X + Y = n, what is distribution of X? Answer X|X+Y=n ~ B(n,p) where p = /( +) hwu 2.3 Inference for the Poisson distribution Ni , i = 1, 2, …, r, i. i. d. Pn(λ), N=ΣNi ˆ   N /r E ˆ    , s.e. ˆ   ˆ   N  ,  / r   r hwu CI for  ˆ  z . ˆ r hwu 2.4 Dispersion and LR tests for Poisson data Homogeneity hypothesis H0: the Ni s are i. i. d. Pn() (for some unknown) Dispersion statistic r X  2  N i 1 (M = sample mean) i M M 2  2 r 1 hwu Likelihood ratio statistic Y  2 Ni log  Ni / M    2 form for calculation – see p18 ◄◄ 2 r 1 hwu §3 SINGLE CLASSIFICATIONS Binary classifications (a) N1 , N2 independent Poisson, with Ni ~ Pn(i) or (b) fixed sample size, N1 + N2 = n, with N1 ~ B(n,p1) where p1 = 1/(1 + 2) hwu Qualitative categories (a) N1 , N2, … , Nr independent Poisson, with Ni ~ Pn(λi) or (b) fixed sample size n, with joint multinomial distribution Mn(n;p) hwu Testing goodness of fit H0: pi = i , i = 1,2, …, r r  Ni  M i  i 1 Mi X  2   all cells 2  observed frequency  expected frequency  expected frequency This is the (Pearson) chi-square statistic 2 hwu The statistic often appears as   O  E  2 E  observed frequency  expected frequency  expected frequency 2 hwu It is distributed (approximately)  r21 or  r2k 1 when k parameters have been estimated in order to fit the model and calculate the expected freqencies hwu An alternative statistic is the LR statistic r Y  2 N i log  N i / M i  2 i 1  2 r 1 or  2 r  k 1 hwu Sparse data/small expected frequencies ensure mi  1 for all cells, and mi  5 for at least about 80% of the cells if not - combine adjacent cells sensibly hwu Goodness-of-fit tests for frequency distributions - very well-known application of the  all cells  observed frequency  expected frequency  expected frequency statistic (see Illustration 3.4 p 22/23) 2 hwu Residuals (standardised) ri  Ni  n i n i 1   i   Ni  M i Mi n  Mi  / n  N  0,1 hwu Residuals (standardised) ri  Ni  n i n i 1   i   Ni  M i Mi n  Mi  / n simpler version Ni  M i ri   N  0,1 Mi  N  0,1 hwu MAJOR ILLUSTRATION 1 Publish and be modelled Number of papers per author 1 2 3 Number of authors 1062 263 120 Model P  X  x  4 50  x x! 5 22 6 7 7 6 8 2 9 0 10 11 1 1 , x  1, 2, 3,... hwu MAJOR ILLUSTRATION 2 Birds in hedges Hedge type i Hedge length (m) li Number of pairs ni A B C D E F G 2320 2460 2455 2805 2335 2645 2099 14 16 14 26 15 40 71 Model Ni ~ Pn(ili) hwu §4 TWO-WAY CLASSIFICATIONS Example 14 Numbers of mice bearing tumours in treated and control groups Treated Control Total Tumours 4 5 9 No tumours 12 74 86 Total 16 79 95 hwu Example 15 Patients in clinical trial Drug Placebo Total Side-effects 15 4 19 No side-effects 35 46 81 Total 50 50 100 hwu Patients in clinical trial – take 2 Drug Placebo Total Side-effects 15 15 30 No side-effects 35 35 70 Total 50 50 100 hwu 4.1 Factors and responses F × R tables R×F , R×R (F × F ?) Qualitative, ordered, quantitative Analysis the same - interpretation may be different hwu A two-way table is often called a “contingency table” (especially in R  R case). hwu Notation (2  2 case, easily extended) Exposed n11 Not exposed n12 Total n1● No disease n21 n22 n2● Total n●1 n●2 n●● = n Disease hwu Three possibilities One overall sample, each subject classified according to 2 attributes - this is R × R Retrospective study Prospective study (use of treated and control groups; drug and placebo etc) hwu 4.2 Distribution theory and tests for r × s tables (a) R × R case (a1) Nij ~ Pn(ij) , independent or, with fixed table total (a2) Condition on n = SSnij : N|n ~ Mn(n ; p) where N = {Nij} , p = {pij}. hwu (b) F × R case Condition on the observed marginal totals n•j = Snij for the s categories of F ( condition on n and n•1)  s independent multinomials hwu Usual hypotheses (a1) Nij ~ Pn(ij) , independent H0: variables/responses are independent ij = i• •j / •• = ki• (a2) Multinomial data (table total fixed) H0: variables/responses are independent P(row i and column j) = P(row i)P(column j) hwu (b) Condition on n and n•j (fixed column totals) Nij ~ Bi(n•j , pij) j = 1,2, …, s ; independent H0: response is homogeneous (pij = pi• for all j) i.e. response has the same distribution for all levels of the factor hwu Tests of H0 The χ2 (Pearson) statistic:  N ij  mij  mij 2  where mij = ni•  n•j /n as before 2 ?? hwu Tests of H0 The χ2 (Pearson) statistic:  N ij  mij  mij 2  where mij = ni•  n•j /n as before 2  r 1 s 1 hwu OR: test based on the LR statistic Y2 Illustration: tonsils data – see p27 In R Pearson/X2 : read data in using “matrix” then use “chisq.test” LR Y2 : calculate it directly (or get it from the results of fitting a “log-linear model”see later) hwu 4.3 The 2  2 table Statistical tests (a) Using Pearson’s χ2 Drug Placebo Total Side-effects 15 4 19 No side-effects 35 46 81 Total 50 50 100 hwu  N ij  mij  mij 2  where mij = ni•  n•j /n row total × column total i.e. grand total 2 1 hwu Yates (continuity) correction Subtract 0.5 from |O – E| before squaring it Performing the test in R n.pat=matrix(c(15,35,4,46),2,2) chisq.test(n.pat) hwu (b) Using deviance/LR statistic Y2 (c) Comparing binomial probabilities (d) Fisher’s exact test hwu Side-effects Drug Placebo Total 15 4 19 N No side-effects 35 46 81 Total 50 50 100 hwu Under a random allocation  50  50     4  15  50!50!19!81!  P( N  4)    0.0039 4!46!15!35!100! 100    19   one-sided P-value = P(N  4) = 0.0047 product of marginal factorials Note : probability  n !  product of cell factorials hwu 4.4 Log odds, combining and collapsing tables, interactions In the 2  2 table, the H0 : independence condition is equivalent to 1122 = 1221 Let λ = log(1122 /1221) Then we have H0: λ = 0 λ is the “log odds ratio” hwu The “λ = 0” hypothesis is often called the “no association” hypothesis. hwu The odds ratio is 1122 /1221 Sample equivalent is n11n22 n11 / n21  n11 / n1  /  n21 / n1    n12 n21 n12 / n22  n12 / n2  /  n22 / n2  odds on for column 1  odds on for column 2  odds ratio (observed / sample version) hwu The odds ratio (or log odds ratio) provides a measure of association for the factors in the table. no association  odds ratio = 1  log odds ratio = 0 hwu Don’t combine heterogeneous tables! hwu Interaction An interaction exists between two factors when the effect of one factor is different at different levels of another factor. 0.000 0.002 0.004 0.006 d.rate 0.008 0.010 0.012 hwu 45 50 55 age 60 0.000 0.002 0.004 0.006 d.rate 0.008 0.010 0.012 hwu 45 50 55 age 60 hwu §5 INTRODUCTION TO GENERALISED LINEAR MODELS (GLMs) Normal linear model Y|x ~ N with E[Y|x]= + x or E[Y|x]= 0 + 1x1 + 2x2 + … + rxr =  x i.e. E[Y|x] = (x) =  x hwu We are explaining (x) using a linear predictor (a linear function of the explanatory data) Generalised linear model Now we set g((x)) =  x for some function g We explain g((x)) using a linear function of the explanatory data, where g is called the link function hwu e.g. modelling a Poisson mean  we use a log link g() = log We use a linear predictor to explain log rather than  itself : the model is Y|x ~ Pn with mean λx with log λx = + x or log λx =  x This is a log-linear model hwu An example is a trend model in which we use logi = +  i Another example is a cyclic model in which we use logi =0 + 1 cosθi + 2 sinθi hwu §6 MODELS FOR SINGLE CLASSIFICATIONS 6.1 Single classifications - trend models Data: numbers in r categories Model: Ni , i = 1, 2, …, r, independent Pn(λi) hwu Basic case H0: λi’s equal v H1: λi’s follow a trend Let Xj be category of observation j P(Xj = i) = 1/r Test based on X see Illustration 6.1 hwu A more general model Ni independent Pn(λi) with  i i  e Log-linear model log i     i hwu It is a linear regression model for logλi and a non-linear regression model for λi . It is a generalised linear model. Here the link between the parameter we are estimating and the linear estimator is the log function - it is a “log link”. hwu Fitting in R Example 13: stressful events data >n=c(15,11, …, 1, 4) >r=length(n) >i=1:r hwu >n=c(15,11, …, 1, 4) >r=length(n) >i=1:r response vector explanatory vector model >stress=glm(n~i,family=poisson) hwu >summary(stress) Call: glm(formula = n ~ i, family = poisson) model being fitted Deviance Residuals: Min 1Q Median 3Q Max -1.9886 -0.9631 0.1737 0.5131 2.0362 summary information on the residuals Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.80316 0.14816 18.920 < 2e-16 *** i -0.08377 0.01680 -4.986 6.15e-07 *** information on the fitted parameters hwu Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 50.843 on 17 degrees of freedom Residual deviance: 24.570 on 16 degrees of freedom deviances (Y2 statistics) AIC: 95.825 Number of Fisher Scoring iterations: 4 hwu Fitted mean is î  exp  2.80316  0.08377i  e.g. for date 6, i = 6 and fitted mean is exp(2.30054) = 9.980 hwu Fitted model log-linear trend model for stress data 15 number 10 5 0 5 10 Date 15 hwu Test of H0: no trend  the null fit, all fitted values equal (to the observed mean) Y2 = 50.84 (~ 2 on 17df) The trend model  fitted values exp(2.80316-0.08377i) Y2 = 24.57 (~ 2 on 16df) Crude 95% CI for slope is -0.084 ± 2(0.0168) i.e. -0.084 ± 0.034 hwu The lower the value of the residual deviance, the better in general is the fit of the model. hwu 0 -2 -4 -6 basicresids 2 4 Basic residuals 5 10 i 15 hwu 6.2 Taking into account a deterministic denominator – using an “offset” for the “exposure” See the Gompertz model example (p 40, data in Example 26) Model: Nx ~ Pn(λx) where E[Nx] = λx = Exbθx logλx = logEx + c + dx hwu We include a term “offset(logE)” in the formula for the linear predictor: in R model = glm(n.deaths ~ age + offset(log(exposure)), family = poisson) Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E) hwu §7 LOGISTIC REGRESSION • for modelling proportions • we have a binary response for each item and a quantitative explanatory variable for example: dependence of the proportion of insects killed in a chamber on the concentration of a chemical present – we want to predict the proportion killed from the concentration hwu for example: dependence of the proportion of  women who smoke - on age  metal bars on test which fail - on pressure applied  policies which give rise to claims – on sum insured Model: # successes at value xi of explanatory variable: Ni ~ bi(ni , πi) hwu We use a glm – we do not predict πi directly; we predict a function of πi called the logit of πi. The logit function is given by: logit( )  log It is the “log odds”.  1  1.0 See Illustration 7.1 p 43: proportion v dose * * * 0.8 * 0.6 0.4 * * * 0.2 prop * * * 3.8 4.0 4.2 4.4 dose 4.6 4.8 3 logit(proportion) v dose * 2 * 1 * 0 logitprop * * -1 * * -2 * * 3.8 4.0 4.2 4.4 dose 4.6 4.8 hwu This leads to the “logistic regression” model i log  a  bxi 1 i [ c.f. log linear model Ni ~ Poisson(λi) with log λi = a + bxi ] hwu We are using a logit link g    log  1  We use a linear predictor to explain log rather than  itself  1  hwu The method based on the use of this model is called logistic regression hwu Data: explanatory # successes group observed variable value size proportion x1 n11 n1 n11/n1 x2 ……. xs n21 n2 n21/n2 ns1 ns ns1/ns hwu In R we declare the proportion of successes as the response and include the group sizes as a set of weights drug.mod1 = glm(propdead ~ dose, weights = groupsize, family = binomial) explanatory vector is dose note the family declaration hwu RHS of model can be extended if required to include additional explanatory variables and factors e.g. mod3 = glm(mat3 ~ age+socialclass+gender) hwu drug.mod – see output p44 Coefficients very highly significant (***) Null deviance 298 on 9df Residual deviance 17.2 on 8df But … residual v fitted plot and … fitted v observed proportions plot hwu 3 Residuals vs Fitted 2 10 0 -1 5 -2 Residuals 1 1 -3 -2 -1 0 1 Predicted values glm(formula = num.mat ~ dose, family = binomial) 2 0.2 0.4 0.6 drug.mod1$fit 0.8 hwu 0.2 0.4 0.6 prop 0.8 1.0 0.8 0.6 0.4 0.2 drug.mod2$fit model with a quadratic term (dose^2) 1.0 hwu 0.2 0.4 0.6 prop 0.8 1.0 hwu §8 MODELS FOR TWO-WAY AND THREE-WAY CLASSIFICATIONS 8.1 Log-linear models for two-way classifications Nij ~ Pn(ij) , i= 1,2, …, r ; j = 1,2, …, s H0: variables are independent ij = i• •j / •• hwu logij = logi• + log•j  log••    row effect  overall effect  column effect hwu We “explain” log ij in terms of additive effects: logij =  + αi + βj Fitted values are the expected frequencies  îj  exp ˆ  î  ˆ j  Fitting process gives us the value of Y2 = -2logλ hwu Fitting a log-linear model Nij ~ Pn(ij) , independent, with logij =  + αi + βj Declare the response vector (the cell frequencies) and the row/column codes as factors then use > name = glm(…) hwu Tonsils data (Example 16) n.tonsils = c(19,497,29,560,24,269) rc = factor(c(1,2,1,2,1,2)) cc = factor(c(1,1,2,2,3,3)) tonsils.mod1 = glm(n.tonsils ~ rc + cc, family=poisson) Call: glm(formula = n.tonsils2 ~ rc + cc, family = poisson) Deviance Residuals: 1 2 3 4 5 6 -1.54915 0.34153 -0.24416 0.05645 2.11018 -0.53736 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.27998 0.12287 26.696 < 2e-16 *** rc2 2.91326 0.12094 24.087 < 2e-16 *** cc2 0.13232 0.06030 2.195 0.0282 * cc3 -0.56593 0.07315 -7.737 1.02e-14 *** --Null deviance: 1487.217 on 5 degrees of freedom Residual deviance: 7.321 on 2 degrees of freedom  Y2 = - 2logλ hwu The fit of the “independent attributes” model is not good hwu Patients data (Example 15) > n.patients = c(15, 4, 35, 46) > rc = factor(c(1, 1, 2, 2)) > cc = factor(c(1, 2, 1, 2)) > pat.mod1 = glm(n.patients ~ rc + cc, family = poisson) Call: glm(formula = n.patients ~ rc + cc, family = poisson) Deviance Residuals: 1 2 3 4 1.6440 -2.0199 -0.8850 0.8457 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.251e+00 2.502e-01 8.996 < 2e-16 *** rc2 1.450e+00 2.549e-01 5.689 1.28e-08 *** cc2 2.184e-10 2.000e-01 1.09e-09 1 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 49.6661 on 3 degrees of freedom Residual deviance: 8.2812 on 1 degrees of freedom AIC: 33.172 hwu fitted coefficients: coef(pat.mod1) (Intercept) rc2 cc2 2.251292e+00 1.450010e+00 2.183513e-10 fitted values: fitted(pat.mod1) 1 2 3 4 9.5 9.5 40.5 40.5 hwu Estimates are ˆ  2.251292, 1  0, ˆ 2  1.450010, 1  0, ˆ2  0 Predictors for cells 1,1 and 1,2 are 2.251292 : ˆ1 j  exp(2.251292) = 9.5 Predictors for cells 2,1 and 2,2 are 2.251292 + 1.450010 = 3.701302 : ˆ 2 j  exp(3.701302) = 40.5 hwu Residual deviance: 8.2812 on 1 degree of freedom  Y2 for testing the model i.e. for testing H0: response is homogeneous/ column distributions are the same/ no association between response and treatment group The lower the value of the residual deviance, the better in general is the fit of the model. Here the fit of the additive model is very poor (we have of course already concluded that there is an association – P-value about 1%). hwu 8.2 Two-way classifications - taking into account a deterministic denominator See the grouse data (Illustration 8.3 p50, data in Example 25) Model: Nij ~ Pn(λij) where E[Nij] = λij = Eij exp( + αi + βj) logE[Nij/Eij] =  + αi + βj i.e. logλij = logEij +  + αi + βj hwu We include a term “offset(logE)” in the formula for the linear predictor Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E) hwu 8.3 Log-linear models for three-way classifications Each subject classified according to 3 factors/variables with r,s,t levels respecitvely Nijk ~ Pn(ijk) with log ijk =  + αi + βj + γk + (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk r  s  t parameters hwu Recall “interaction” Model with two factors and an interaction (no longer additive) is log ij =  + αi + βj + (αβ)ij hwu 8.4 Hierarchic log-linear models Interpretation! Range of possible models/dependencies From 1 Complete independence model formula: A + B + C link: log ijk =  + αi + βj + γk notation: [A][B][C] df: rst – r – s – t + 2 hwu …. through 2 One interaction (B and C say) model formula: A + B*C link: log ijk =  + αi + βj + γk + (βγ)jk notation: [A][BC] df: rst – r – st + 1 hwu …. to 5 All possible interactions model formula: A*B*C notation: [ABC] df: 0 hwu Model selection: by backward elimination or forward selection through the hierarchy of models containing all 3 variables hwu saturated [ABC] [AB] [AB] [AC] [AB] [C] [AC] [AB] [BC] [A] [BC] [A] [B] [C] independence [BC] [AC][BC] [AC] [B] hwu Our models can include mean (intercept) + factor effects + 2-way interactions + 3-way interaction hwu Illustration 8.4 Models for lizards data (Example 29) liz = array(c(32, 86, 11, 35, 61, 73, 41, 70), dim = c(2, 2, 2)) n.liz = as.vector(liz) s = factor(c(1,1,1,1,2,2,2,2))  species d = factor(c(1, 1, 2, 2, 1, 1, 2, 2))  diameter of perch h = factor(c(1,2,1,2,1,2,1,2))  height of perch hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) liz.mod3 = glm(n.liz ~ s + d*h, family = poisson) liz.mod4 = glm(n.liz ~ s*h + d, family = poisson) liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson) hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) 25.04 on 4df liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † 12.43 on 3df liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson) hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)† 2.03 on 2df hwu > summary(liz.mod5) Call: glm(formula = n.liz ~ s * d + s * h, family = poisson) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.4320 0.1601 21.436 < 2e-16 *** s2 0.5895 0.1970 2.992 0.002769 ** d2 -0.9420 0.1738 -5.420 5.97e-08 *** h2 1.0346 0.1775 5.827 5.63e-09 *** s2:d2 0.7537 0.2161 3.488 0.000486 *** s2:h2 -0.6967 0.2198 -3.170 0.001526 ** Null deviance: 98.5830 on 7 degrees of freedom Residual deviance: 2.0256 on 2 degrees of freedom hwu 80 * * liz.mod5$fit 60 * * 40 * * 20 * * 20 40 60 n.liz 80 -0.10 -0.05 0.00 liz.mod5$res 0.05 0.10 hwu * * * * * * * * 20 40 60 liz.mod5$fit 80 hwu FIN hwu MAJOR ILLUSTRATION 1 Number of papers per author 1 2 3 Number of authors 1062 263 120 Model P  X  x  4 50  x x! 5 22 6 7 7 6 8 2 9 0 10 11 1 1 , x  1, 2, 3,... 5.0 4.5 logl2 5.5 hwu 0.90 0.92 0.94 0.96 th 0.98 1.00 0 200 400 600 800 1000 hwu 1 2 3 4 5 6 7 8 9 10 11+ hwu hwu MAJOR ILLUSTRATION 2 Hedge type i Hedge length (m) li Number of pairs ni A B C D E F G 2320 2460 2455 2805 2335 2645 2099 14 16 14 26 15 40 71 Model Ni ~ Pn(ili) 50 hwu density 30 40 x 20 x x 10 x x x x x x x x x 1 2 3 x 0 x 4 type 5 6 7 hwu Cyclic models leukaemia data 60 cases 50 40 30 J F M A M J J Month A S O N D hwu Model Ni independent Pn(λi) with i  0 exp  cos i     0 exp  a cos i  b sin i  , i  1, ..., r  exp  c  a cos i  b sin i  Explanatory variable: the category/month i has been transformed into an angle i hwu It is another example of a non-linear regression model for Poisson responses. It is a generalised linear model. hwu Fitting in R >n=c(40, 34, …, 33, 38) response vector >r=length(n) >i=1:r >th=2*pi*i/r explanatory vector model >leuk=glm(n~cos(th) + sin(th),family=poisson) hwu Fitted mean is î  exp  3.73069  0.17177cosi  0.11982sin i  hwu Fitted model cyclic model for leukaemia data 60 cases 50 40 30 J F M A M J J Month A S O N D hwu F73DB3 CDA Data from class Male Female Cinema often 22 21 Not often 20 12 hwu Male Female Cinema often 22 21 43 Not often 20 12 32 42 33 75 Male Female Cinema often 22 21 43 Not often 20 12 32 42 33 75 P(often|male) = 22/42 = 0.524 P(often|female) = 21/33 = 0.636 significant difference (on these numbers)? is there an association between gender and cinema attendance? hwu Null hypothesis H0: no association between gender and cinema attendance Alternative: not H0 Under H0 we expect 42  43/75 = 24.08 in cell 1,1 etc. hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = 0.5522, df = 1, p-value = 0.4574 > chisq.test(matcinema)$expected [,1] [,2] [1,] 24.08 18.92 [2,] 17.92 14.08 hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = 0.5522, df = 1, p-value = 0.4574 > chisq.test(matcinema)$expected [,1] [,2] null hypothesis can stand [1,] 24.08 18.92 no association between gender [2,] 17.92 14.08 and cinema attendance hwu more students, same proportions Male Female Cinema often 110 105 215 Not often 100 60 160 210 165 P(often|male) = 110/210 = 0.524 P(often|female) = 105/60 = 0.636 significant difference (on these numbers)? hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = 4.3361, df = 1, p-value = 0.03731 > chisq.test(matcinema2)$expected [,1] [,2] [1,] 120.4 94.6 [2,] 89.6 70.4 hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = 4.3361, df = 1, p-value = 0.03731 > chisq.test(matcinema2)$expected [,1] [,2] null hypothesis is rejected [1,] 120.4 94.6 there IS an association between [2,] 89.6 70.4 gender and cinema attendance hwu FIN

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide_Show