Download Slide_Show

Document related concepts
no text concepts found
Transcript
hwu
F73DB3 CATEGORICAL DATA ANALYSIS
Workbook
Contents page
Preface
Aims
Summary
Content/structure/syllabus
plus other information
Background – computing (R)
hwu
Examples
Single classifications (1-13)
Two-way classifications (14-27)
Three-way classifications (28-32)
hwu
Example 1 Eye colours
Colour
A
B
C
D
Frequency observed 89
66
60
85
hwu
Example 2 Prussian cavalry deaths
(a)Numbers killed in each unit in each year
- frequency table
Number 0
killed
Frequency 144
observed
1
2
3
4
5
Total
91
32
11
2
0
280
hwu
Example 2 Prussian cavalry deaths
(b) Numbers killed in each unit in each year
– raw data
0010020000 .................. .....0
0 0 2 0 1 0 1 2 0 1 . . . . . . . . . . . . . . . . . . . . . . . .0
…..
…..
300100210010010011201011
hwu
Example 2 Prussian cavalry deaths
(c) Total numbers killed each year
1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘94
3
5
7
9
10
18 6
14 11 9
5
11 15 6
11 17 12 15
8
4
hwu
Example 4 Political views
1
2
(very L)
46
179
3
196
4
5
(centre)
559
232
6
7
(very R)
Don’t
Know
Total
150
35
93
1490
hwu
Example 7 Vehicle repair visits
Number of visits
0
1
2
3 4 5
Frequency 295 190 53 5 5 2
observed
6
Total
0
550
hwu
Example 15 Patients in clinical trial
Drug
Placebo
Total
Side-effects
15
4
19
No side-effects
35
46
81
Total
50
50
100
hwu
§1 INTRODUCTION
Data are counts/frequencies (not measurements)
Categories (explanatory variable)
Distribution in the cells (response)
Frequency distribution
Single classifications
Two-way classifications
hwu
Illustration 1.1
B: Cause of death
A: Smoking
status
Cancer
Other
Smoker
30
20
Not smoker
15
35
hwu
Data may arise as
Bernoulli/binomial data (2 outcomes)
Multinomial data (more than 2 outcomes)
Poisson data
[+ Negative binomial data – the version
with range x = 0,1,2, …]
hwu
§2 POISSON PROCESS AND
ASSOCIATED DISTRIBUTIONS
hwu
2.1 Bernoulli trials and related
distributions
Number of successes
– binomial distribution
[Time before kth success
– negative binomial distribution
Time to first success
– geometric distribution]
Conditional distribution of success
times
hwu
2.2 Poisson process and related distributions



time 
hwu
Poisson process with rate λ
Number of events in a time interval of
length t, Nt , has a Poisson distribution
with mean t
PN t  n   e
 t
t 
n
n!
, n  0, 1, 2, ...
hwu
Poisson process with rate λ
Inter-event time, T, has an exponential
distribution with parameter  (mean 1/)
f  t   e , t  0
 t
hwu
Conditional distribution of number of events
given n events in time (0,t)

how many in time (0,s) (s < t)?

hwu
Conditional distribution of number of events
given n events in time (0,t)

how many in time (0,s) (s < t)?

Answer
Ns|Nt = n ~ B(n,s/t)
hwu
Splitting into subprocesses



time 
hwu
0
20
40
N
60
80
# events
100
Realisation of a Poisson process
0
10
30
20
t
40
50
time
hwu
X ~ Pn(), Y ~ Pn()
X,Y independent
then we know X + Y ~ Pn( +)
Given X + Y = n, what is distribution of X?
hwu
X ~ Pn(), Y ~ Pn()
X,Y independent
then we know X + Y ~ Pn(+)
Given X + Y = n, what is distribution of X?
Answer
X|X+Y=n ~ B(n,p) where p = /( +)
hwu
2.3 Inference for the Poisson distribution
Ni , i = 1, 2, …, r, i. i. d. Pn(λ), N=ΣNi
ˆ
  N /r
E ˆ    , s.e. ˆ  
ˆ
  N  ,  / r 

r
hwu
CI for 
ˆ  z
.
ˆ
r
hwu
2.4 Dispersion and LR tests for Poisson data
Homogeneity hypothesis
H0: the Ni s are i. i. d. Pn()
(for some unknown)
Dispersion statistic
r
X 
2
 N
i 1
(M = sample mean)
i
M
M
2

2
r 1
hwu
Likelihood ratio statistic
Y  2 Ni log  Ni / M   
2
form for calculation – see p18 ◄◄
2
r 1
hwu
§3 SINGLE CLASSIFICATIONS
Binary classifications
(a) N1 , N2 independent Poisson, with Ni ~ Pn(i)
or
(b) fixed sample size, N1 + N2 = n, with N1 ~ B(n,p1)
where p1 = 1/(1 + 2)
hwu
Qualitative categories
(a) N1 , N2, … , Nr independent Poisson, with
Ni ~ Pn(λi)
or
(b) fixed sample size n, with joint multinomial
distribution Mn(n;p)
hwu
Testing goodness of fit
H0: pi = i , i = 1,2, …, r
r
 Ni  M i 
i 1
Mi
X 
2


all cells
2
 observed frequency  expected frequency 
expected frequency
This is the (Pearson) chi-square statistic
2
hwu
The statistic often appears as


O  E 
2
E
 observed frequency  expected frequency 
expected frequency
2
hwu
It is distributed (approximately)  r21
or
 r2k 1 when k parameters have been estimated
in order to fit the model and calculate
the expected freqencies
hwu
An alternative statistic is the LR statistic
r
Y  2 N i log  N i / M i 
2
i 1

2
r 1
or

2
r  k 1
hwu
Sparse data/small expected frequencies
ensure mi  1 for all cells, and
mi  5 for at least about 80%
of the cells
if not - combine adjacent cells sensibly
hwu
Goodness-of-fit tests for frequency distributions
- very well-known application of the

all cells
 observed frequency  expected frequency 
expected frequency
statistic (see Illustration 3.4 p 22/23)
2
hwu
Residuals (standardised)
ri 
Ni  n i
n i 1   i 

Ni  M i
Mi n  Mi  / n
 N  0,1
hwu
Residuals (standardised)
ri 
Ni  n i
n i 1   i 

Ni  M i
Mi n  Mi  / n
simpler version
Ni  M i
ri 
 N  0,1
Mi
 N  0,1
hwu
MAJOR ILLUSTRATION 1
Publish and be modelled
Number of papers per author 1 2
3
Number of authors
1062 263 120
Model
P  X  x 
4
50

x
x!
5
22
6
7
7
6
8
2
9
0
10 11
1 1
, x  1, 2, 3,...
hwu
MAJOR ILLUSTRATION 2
Birds in hedges
Hedge type i
Hedge length (m) li
Number of pairs ni
A
B
C
D
E
F
G
2320 2460 2455 2805 2335 2645 2099
14
16
14
26
15
40
71
Model
Ni ~ Pn(ili)
hwu
§4 TWO-WAY CLASSIFICATIONS
Example 14 Numbers of mice bearing tumours
in treated and control groups
Treated
Control
Total
Tumours
4
5
9
No tumours
12
74
86
Total
16
79
95
hwu
Example 15 Patients in clinical trial
Drug
Placebo
Total
Side-effects
15
4
19
No side-effects
35
46
81
Total
50
50
100
hwu
Patients in clinical trial – take 2
Drug
Placebo
Total
Side-effects
15
15
30
No side-effects
35
35
70
Total
50
50
100
hwu
4.1 Factors and responses
F × R tables
R×F , R×R
(F × F ?)
Qualitative, ordered, quantitative
Analysis the same - interpretation may be
different
hwu
A two-way table is often called a
“contingency table”
(especially in R  R case).
hwu
Notation (2  2 case, easily extended)
Exposed
n11
Not exposed
n12
Total
n1●
No disease
n21
n22
n2●
Total
n●1
n●2
n●● = n
Disease
hwu
Three possibilities
One overall sample, each subject
classified according to 2 attributes
- this is R × R
Retrospective study
Prospective study (use of treated and
control groups; drug and placebo etc)
hwu
4.2 Distribution theory and tests
for r × s tables
(a) R × R case
(a1) Nij ~ Pn(ij) , independent
or, with fixed table total
(a2) Condition on n = SSnij :
N|n ~ Mn(n ; p)
where N = {Nij} , p = {pij}.
hwu
(b) F × R case
Condition on the observed marginal
totals n•j = Snij for the s categories
of F ( condition on n and n•1)
 s independent multinomials
hwu
Usual hypotheses
(a1) Nij ~ Pn(ij) , independent
H0: variables/responses are independent
ij = i• •j / •• = ki•
(a2) Multinomial data (table total fixed)
H0: variables/responses are independent
P(row i and column j) =
P(row i)P(column j)
hwu
(b) Condition on n and n•j (fixed column totals)
Nij ~ Bi(n•j , pij) j = 1,2, …, s ; independent
H0: response is homogeneous (pij = pi• for all j)
i.e. response has the same distribution for
all levels of the factor
hwu
Tests of H0
The χ2 (Pearson) statistic:

N
ij
 mij 
mij
2

where mij = ni•  n•j /n as before
2
??
hwu
Tests of H0
The χ2 (Pearson) statistic:

N
ij
 mij 
mij
2

where mij = ni•  n•j /n as before
2
 r 1 s 1
hwu
OR: test based on the LR statistic Y2
Illustration: tonsils data – see p27
In R
Pearson/X2 : read data in using “matrix”
then use “chisq.test”
LR Y2 : calculate it directly (or get it from
the results of fitting a “log-linear model”see later)
hwu
4.3 The 2  2 table
Statistical tests
(a) Using Pearson’s χ2
Drug
Placebo
Total
Side-effects
15
4
19
No side-effects
35
46
81
Total
50
50
100
hwu

N
ij
 mij 
mij
2

where mij = ni•  n•j /n
row total × column total
i.e.
grand total
2
1
hwu
Yates (continuity) correction
Subtract 0.5 from |O – E| before squaring it
Performing the test in R
n.pat=matrix(c(15,35,4,46),2,2)
chisq.test(n.pat)
hwu
(b) Using deviance/LR statistic Y2
(c) Comparing binomial probabilities
(d) Fisher’s exact test
hwu
Side-effects
Drug
Placebo
Total
15
4
19
N
No side-effects
35
46
81
Total
50
50
100
hwu
Under a random allocation
 50  50 
  
4  15 
50!50!19!81!

P( N  4) 

 0.0039
4!46!15!35!100!
100 


19


one-sided P-value = P(N  4) = 0.0047
product of marginal factorials
Note : probability 
n !  product of cell factorials
hwu
4.4 Log odds, combining and collapsing tables,
interactions
In the 2  2 table, the
H0 : independence
condition is equivalent to
1122 = 1221
Let λ = log(1122 /1221)
Then we have H0: λ = 0
λ is the “log odds ratio”
hwu
The “λ = 0” hypothesis is often called the
“no association” hypothesis.
hwu
The odds ratio is
1122 /1221
Sample equivalent is
n11n22 n11 / n21  n11 / n1  /  n21 / n1 


n12 n21 n12 / n22  n12 / n2  /  n22 / n2 
odds on for column 1

odds on for column 2
 odds ratio (observed / sample version)
hwu
The odds ratio (or log odds ratio)
provides a measure of association for
the factors in the table.
no association  odds ratio = 1
 log odds ratio = 0
hwu
Don’t combine
heterogeneous
tables!
hwu
Interaction
An interaction exists between two factors
when the effect of one factor is different
at different levels of another factor.
0.000
0.002
0.004
0.006
d.rate
0.008
0.010
0.012
hwu
45
50
55
age
60
0.000
0.002
0.004
0.006
d.rate
0.008
0.010
0.012
hwu
45
50
55
age
60
hwu
§5 INTRODUCTION TO GENERALISED
LINEAR MODELS (GLMs)
Normal linear model
Y|x ~ N with
E[Y|x]= + x
or
E[Y|x]= 0 + 1x1 + 2x2 + … + rxr =  x
i.e. E[Y|x] = (x) =  x
hwu
We are explaining (x) using a linear predictor
(a linear function of the explanatory data)
Generalised linear model
Now we set g((x)) =  x for some function g
We explain g((x)) using a linear function of
the explanatory data, where g is called the link
function
hwu
e.g. modelling a Poisson mean  we use a
log link g() = log
We use a linear predictor to explain log
rather than  itself : the model is
Y|x ~ Pn with mean λx
with log λx = + x
or
log λx =  x
This is a log-linear model
hwu
An example is a trend model in which we use
logi = +  i
Another example is a cyclic model in which we
use
logi =0 + 1 cosθi + 2 sinθi
hwu
§6 MODELS FOR SINGLE CLASSIFICATIONS
6.1 Single classifications - trend models
Data: numbers in r categories
Model: Ni , i = 1, 2, …, r,
independent Pn(λi)
hwu
Basic case
H0: λi’s equal v H1: λi’s follow a trend
Let Xj be category of observation j
P(Xj = i) = 1/r
Test based on
X
see Illustration 6.1
hwu
A more general model
Ni independent Pn(λi) with
 i
i  e
Log-linear model
log i     i
hwu
It is a linear regression model for logλi
and a non-linear regression model for λi .
It is a
generalised linear model.
Here the link between the parameter we are
estimating and the linear estimator is the
log function - it is a “log link”.
hwu
Fitting in R
Example 13: stressful events data
>n=c(15,11, …, 1, 4)
>r=length(n)
>i=1:r
hwu
>n=c(15,11, …, 1, 4)
>r=length(n)
>i=1:r
response vector
explanatory vector
model
>stress=glm(n~i,family=poisson)
hwu
>summary(stress)
Call:
glm(formula = n ~ i, family = poisson)
model being fitted
Deviance Residuals:
Min
1Q Median
3Q
Max
-1.9886 -0.9631 0.1737 0.5131 2.0362
summary information on the residuals
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.80316 0.14816 18.920 < 2e-16 ***
i
-0.08377
0.01680 -4.986 6.15e-07 ***
information on the fitted parameters
hwu
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 50.843 on 17 degrees of freedom
Residual deviance: 24.570 on 16 degrees of freedom
deviances (Y2 statistics)
AIC: 95.825
Number of Fisher Scoring iterations: 4
hwu
Fitted mean is
ˆi  exp  2.80316  0.08377i 
e.g. for date 6, i = 6 and fitted mean is
exp(2.30054) = 9.980
hwu
Fitted model
log-linear trend model for stress data
15
number
10
5
0
5
10
Date
15
hwu
Test of H0: no trend
 the null fit, all fitted values equal (to the
observed mean)
Y2 = 50.84 (~ 2 on 17df)
The trend model
 fitted values exp(2.80316-0.08377i)
Y2 = 24.57 (~ 2 on 16df)
Crude 95% CI for slope is -0.084 ± 2(0.0168)
i.e. -0.084 ± 0.034
hwu
The lower the value of the residual
deviance, the better in general is
the fit of the model.
hwu
0
-2
-4
-6
basicresids
2
4
Basic residuals
5
10
i
15
hwu
6.2 Taking into account a deterministic
denominator – using an “offset” for the “exposure”
See the Gompertz model example (p 40, data
in Example 26)
Model: Nx ~ Pn(λx) where
E[Nx] = λx = Exbθx
logλx = logEx + c + dx
hwu
We include a term “offset(logE)” in the formula
for the linear predictor: in R
model = glm(n.deaths ~ age +
offset(log(exposure)), family = poisson)
Fitted value is the estimate of the expected
response per unit of exposure
(i.e. per unit of the offset E)
hwu
§7 LOGISTIC REGRESSION
• for modelling proportions
• we have a binary response for each item
and a quantitative explanatory variable
for example: dependence of the
proportion of insects killed in a chamber
on the concentration of a chemical
present – we want to predict the
proportion killed from the concentration
hwu
for example: dependence of the proportion of
 women who smoke - on age
 metal bars on test which fail - on pressure
applied
 policies which give rise to claims – on sum
insured
Model: # successes at value xi of explanatory
variable: Ni ~ bi(ni , πi)
hwu
We use a glm – we do not predict πi directly;
we predict a function of πi called the logit of
πi.
The logit function is given by:
logit( )  log
It is the “log odds”.

1 
1.0
See Illustration 7.1 p 43: proportion v dose
*
*
*
0.8
*
0.6
0.4
*
*
*
0.2
prop
*
*
*
3.8
4.0
4.2
4.4
dose
4.6
4.8
3
logit(proportion) v dose
*
2
*
1
*
0
logitprop
*
*
-1
*
*
-2
*
*
3.8
4.0
4.2
4.4
dose
4.6
4.8
hwu
This leads to the “logistic regression” model
i
log
 a  bxi
1 i
[ c.f. log linear model Ni ~ Poisson(λi) with
log λi = a + bxi ]
hwu
We are using a logit link
g    log

1 
We use a linear predictor to explain
log
rather than  itself

1 
hwu
The method based on the use
of this model is called
logistic regression
hwu
Data:
explanatory # successes group observed
variable value
size
proportion
x1
n11
n1
n11/n1
x2
…….
xs
n21
n2
n21/n2
ns1
ns
ns1/ns
hwu
In R we declare the proportion of successes as
the response and include the group sizes as a
set of weights
drug.mod1 = glm(propdead ~ dose,
weights = groupsize, family = binomial)
explanatory vector is dose
note the family declaration
hwu
RHS of model can be extended if required to
include additional explanatory variables and
factors
e.g. mod3 = glm(mat3 ~ age+socialclass+gender)
hwu
drug.mod – see output p44
Coefficients very highly significant (***)
Null deviance 298 on 9df
Residual deviance 17.2 on 8df
But … residual v fitted plot
and … fitted v observed proportions plot
hwu
3
Residuals vs Fitted
2
10
0
-1
5
-2
Residuals
1
1
-3
-2
-1
0
1
Predicted values
glm(formula = num.mat ~ dose, family = binomial)
2
0.2
0.4
0.6
drug.mod1$fit
0.8
hwu
0.2
0.4
0.6
prop
0.8
1.0
0.8
0.6
0.4
0.2
drug.mod2$fit
model
with a
quadratic
term
(dose^2)
1.0
hwu
0.2
0.4
0.6
prop
0.8
1.0
hwu
§8 MODELS FOR TWO-WAY AND THREE-WAY
CLASSIFICATIONS
8.1 Log-linear models for two-way
classifications
Nij ~ Pn(ij) , i= 1,2, …, r ; j = 1,2, …, s
H0: variables are independent
ij = i• •j / ••
hwu
logij = logi• + log•j  log••



row effect  overall effect

column effect
hwu
We “explain” log ij in terms of additive
effects:
logij =  + αi + βj
Fitted values are the expected frequencies

ˆij  exp ˆ  ˆi  ˆ j

Fitting process gives us the value of Y2 = -2logλ
hwu
Fitting a log-linear model
Nij ~ Pn(ij) , independent, with
logij =  + αi + βj
Declare the response vector (the cell
frequencies) and the row/column
codes as factors
then use > name = glm(…)
hwu
Tonsils data (Example 16)
n.tonsils = c(19,497,29,560,24,269)
rc = factor(c(1,2,1,2,1,2))
cc = factor(c(1,1,2,2,3,3))
tonsils.mod1 = glm(n.tonsils ~ rc + cc,
family=poisson)
Call:
glm(formula = n.tonsils2 ~ rc + cc, family = poisson)
Deviance Residuals:
1
2
3
4
5
6
-1.54915 0.34153 -0.24416 0.05645 2.11018 -0.53736
Coefficients:
Estimate Std. Error z value
Pr(>|z|)
(Intercept) 3.27998 0.12287
26.696
< 2e-16 ***
rc2
2.91326 0.12094
24.087
< 2e-16 ***
cc2
0.13232 0.06030
2.195
0.0282 *
cc3
-0.56593 0.07315
-7.737 1.02e-14 ***
--Null deviance: 1487.217 on 5 degrees of freedom
Residual deviance: 7.321 on 2 degrees of freedom
 Y2 = - 2logλ
hwu
The fit of the “independent attributes”
model is not good
hwu
Patients data (Example 15)
> n.patients = c(15, 4, 35, 46)
> rc = factor(c(1, 1, 2, 2))
> cc = factor(c(1, 2, 1, 2))
> pat.mod1 = glm(n.patients ~ rc + cc,
family = poisson)
Call:
glm(formula = n.patients ~ rc + cc, family = poisson)
Deviance Residuals:
1
2
3
4
1.6440 -2.0199 -0.8850 0.8457
Coefficients:
Estimate
Std. Error z value Pr(>|z|)
(Intercept) 2.251e+00 2.502e-01
8.996
< 2e-16 ***
rc2
1.450e+00 2.549e-01
5.689
1.28e-08 ***
cc2
2.184e-10 2.000e-01
1.09e-09 1
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 49.6661 on 3 degrees of freedom
Residual deviance: 8.2812 on 1 degrees of freedom
AIC: 33.172
hwu
fitted coefficients: coef(pat.mod1)
(Intercept)
rc2
cc2
2.251292e+00 1.450010e+00
2.183513e-10
fitted values: fitted(pat.mod1)
1
2
3
4
9.5 9.5 40.5 40.5
hwu
Estimates are
ˆ  2.251292, 1  0, ˆ 2  1.450010, 1  0, ˆ2  0
Predictors for cells 1,1 and 1,2 are 2.251292 :
ˆ1 j  exp(2.251292) = 9.5
Predictors for cells 2,1 and 2,2 are
2.251292 + 1.450010 = 3.701302 :
ˆ 2 j  exp(3.701302) = 40.5
hwu
Residual deviance: 8.2812 on 1 degree of freedom
 Y2 for testing the model
i.e. for testing H0:
response is homogeneous/
column distributions are the same/
no association between response and treatment
group
The lower the value of the residual deviance, the
better in general is the fit of the model.
Here the fit of the additive model is very poor
(we have of course already concluded that there
is an association – P-value about 1%).
hwu
8.2 Two-way classifications - taking into account
a deterministic denominator
See the grouse data (Illustration 8.3 p50,
data in Example 25)
Model: Nij ~ Pn(λij) where
E[Nij] = λij = Eij exp( + αi + βj)
logE[Nij/Eij] =  + αi + βj
i.e. logλij = logEij +  + αi + βj
hwu
We include a term “offset(logE)” in the formula
for the linear predictor
Fitted value is the estimate of the expected
response per unit of exposure (i.e. per unit of
the offset E)
hwu
8.3 Log-linear models for three-way
classifications
Each subject classified according to 3
factors/variables with r,s,t levels respecitvely
Nijk ~ Pn(ijk) with
log ijk =  + αi + βj + γk
+ (αβ)ij + (αγ)ik + (βγ)jk + (αβγ)ijk
r  s  t parameters
hwu
Recall “interaction”
Model with two factors and an
interaction (no longer additive) is
log ij =  + αi + βj + (αβ)ij
hwu
8.4 Hierarchic log-linear models
Interpretation!
Range of possible models/dependencies
From
1 Complete independence
model formula: A + B + C
link: log ijk =  + αi + βj + γk
notation: [A][B][C]
df: rst – r – s – t + 2
hwu
…. through
2 One interaction (B and C say)
model formula: A + B*C
link: log ijk =  + αi + βj + γk + (βγ)jk
notation: [A][BC]
df: rst – r – st + 1
hwu
…. to
5 All possible interactions
model formula: A*B*C
notation: [ABC]
df: 0
hwu
Model selection: by
backward elimination or
forward selection
through the hierarchy of models
containing all 3 variables
hwu
saturated
[ABC]
[AB]
[AB] [AC]
[AB] [C]
[AC]
[AB] [BC]
[A] [BC]
[A] [B] [C]
independence
[BC]
[AC][BC]
[AC] [B]
hwu
Our models can include
mean (intercept)
+ factor effects
+ 2-way interactions
+ 3-way interaction
hwu
Illustration 8.4 Models for lizards data
(Example 29)
liz = array(c(32, 86, 11, 35, 61, 73, 41, 70),
dim = c(2, 2, 2))
n.liz = as.vector(liz)
s = factor(c(1,1,1,1,2,2,2,2))
 species
d = factor(c(1, 1, 2, 2, 1, 1, 2, 2))  diameter of
perch
h = factor(c(1,2,1,2,1,2,1,2))
 height of perch
hwu
Forward selection
liz.mod1 = glm(n.liz ~ s + d + h, family = poisson)
liz.mod2 = glm(n.liz ~ s*d + h, family = poisson)
liz.mod3 = glm(n.liz ~ s + d*h, family = poisson)
liz.mod4 = glm(n.liz ~ s*h + d, family = poisson)
liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)
liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson)
hwu
Forward selection
liz.mod1 = glm(n.liz ~ s + d + h, family = poisson)
25.04 on 4df
liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) †
12.43 on 3df
liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)
liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson)
hwu
Forward selection
liz.mod1 = glm(n.liz ~ s + d + h, family = poisson)
liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) †
liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)†
2.03 on 2df
hwu
> summary(liz.mod5)
Call:
glm(formula = n.liz ~ s * d + s * h, family = poisson)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.4320 0.1601
21.436
< 2e-16 ***
s2
0.5895 0.1970
2.992
0.002769 **
d2
-0.9420 0.1738
-5.420
5.97e-08 ***
h2
1.0346 0.1775
5.827
5.63e-09 ***
s2:d2
0.7537 0.2161
3.488
0.000486 ***
s2:h2
-0.6967 0.2198
-3.170
0.001526 **
Null deviance: 98.5830 on 7 degrees of freedom
Residual deviance: 2.0256 on 2 degrees of freedom
hwu
80
*
*
liz.mod5$fit
60
*
*
40
*
*
20
*
*
20
40
60
n.liz
80
-0.10
-0.05
0.00
liz.mod5$res
0.05
0.10
hwu
*
*
* *
*
*
*
*
20
40
60
liz.mod5$fit
80
hwu
FIN
hwu
MAJOR ILLUSTRATION 1
Number of papers per author 1 2
3
Number of authors
1062 263 120
Model
P  X  x 
4
50

x
x!
5
22
6
7
7
6
8
2
9
0
10 11
1 1
, x  1, 2, 3,...
5.0
4.5
logl2
5.5
hwu
0.90
0.92
0.94
0.96
th
0.98
1.00
0
200
400
600
800
1000
hwu
1
2
3
4
5
6
7
8
9
10
11+
hwu
hwu
MAJOR ILLUSTRATION 2
Hedge type i
Hedge length (m) li
Number of pairs ni
A
B
C
D
E
F
G
2320 2460 2455 2805 2335 2645 2099
14
16
14
26
15
40
71
Model
Ni ~ Pn(ili)
50
hwu
density
30
40
x
20
x
x
10
x
x
x
x
x
x
x
x
x
1
2
3
x
0
x
4
type
5
6
7
hwu
Cyclic models
leukaemia data
60
cases
50
40
30
J
F
M
A
M
J
J
Month
A
S
O
N
D
hwu
Model
Ni independent Pn(λi) with
i  0 exp  cos i   
 0 exp  a cos i  b sin i  , i  1, ..., r
 exp  c  a cos i  b sin i 
Explanatory variable: the category/month i
has been transformed into an angle i
hwu
It is another example of a
non-linear regression model
for Poisson responses.
It is a
generalised linear model.
hwu
Fitting in R
>n=c(40, 34, …, 33, 38)
response vector
>r=length(n)
>i=1:r
>th=2*pi*i/r
explanatory vector
model
>leuk=glm(n~cos(th) + sin(th),family=poisson)
hwu
Fitted mean is
ˆi  exp  3.73069  0.17177cosi  0.11982sin i 
hwu
Fitted model
cyclic model for leukaemia data
60
cases
50
40
30
J
F
M
A
M
J
J
Month
A
S
O
N
D
hwu
F73DB3 CDA Data from class
Male
Female
Cinema often
22
21
Not often
20
12
hwu
Male
Female
Cinema often
22
21
43
Not often
20
12
32
42
33
75
Male
Female
Cinema often
22
21
43
Not often
20
12
32
42
33
75
P(often|male) = 22/42 = 0.524
P(often|female) = 21/33 = 0.636
significant difference (on these numbers)?
is there an association between gender and cinema
attendance?
hwu
Null hypothesis H0: no association between
gender and cinema attendance
Alternative: not H0
Under H0 we expect 42  43/75 = 24.08 in cell
1,1 etc.
hwu
> matcinema=matrix(c(22,20,21,12),2,2)
> chisq.test(matcinema)
Pearson's Chi-squared test with Yates'
continuity correction
data: matcinema
X-squared = 0.5522, df = 1, p-value = 0.4574
> chisq.test(matcinema)$expected
[,1] [,2]
[1,] 24.08 18.92
[2,] 17.92 14.08
hwu
> matcinema=matrix(c(22,20,21,12),2,2)
> chisq.test(matcinema)
Pearson's Chi-squared test with Yates'
continuity correction
data: matcinema
X-squared = 0.5522, df = 1, p-value = 0.4574
> chisq.test(matcinema)$expected
[,1] [,2]
null hypothesis can stand
[1,] 24.08 18.92 no association between gender
[2,] 17.92 14.08 and cinema attendance
hwu
more students, same proportions
Male
Female
Cinema often
110
105
215
Not often
100
60
160
210
165
P(often|male) = 110/210 = 0.524
P(often|female) = 105/60 = 0.636
significant difference (on these numbers)?
hwu
> matcinema2=matrix(c(110,100,105,60),2,2)
> chisq.test(matcinema2)
Pearson's Chi-squared test with Yates'
continuity correction
data: matcinema2
hwu
> matcinema2=matrix(c(110,100,105,60),2,2)
> chisq.test(matcinema2)
Pearson's Chi-squared test with Yates'
continuity correction
data: matcinema2
X-squared = 4.3361, df = 1, p-value = 0.03731
> chisq.test(matcinema2)$expected
[,1] [,2]
[1,] 120.4 94.6
[2,] 89.6 70.4
hwu
> matcinema2=matrix(c(110,100,105,60),2,2)
> chisq.test(matcinema2)
Pearson's Chi-squared test with Yates' continuity
correction
data: matcinema2
X-squared = 4.3361, df = 1, p-value = 0.03731
> chisq.test(matcinema2)$expected
[,1] [,2]
null hypothesis is rejected
[1,] 120.4 94.6 there IS an association between
[2,] 89.6 70.4
gender and cinema attendance
hwu
FIN
Related documents