Download GLM_2012 - Department of Statistics Oxford

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistical modelling
Gil McVean, Department of Statistics
Tuesday 24th Jan 2012
Questions to ask…
•
How can we use models in statistical inference?
•
Where do commonly-used models come from?
•
What is a likelihood?
•
How can we use likelihood to estimate parameters, measure certainty and test
hypotheses?
•
How do we add covariates to simple models?
•
How do you construct and use linear models?
•
What’s so general about generalised linear modelling?
2
What is a random variable?
•
A random variable is a number associated with the outcome of a stochastic process
– Waiting time for next bus
– Average number hours sunshine in May
– Age of current prime-minister
•
In statistics, we want to take observations of random variables and use this to make
statements about the underlying stochastic process
– Did this vaccine have any effect?
– Which genes contribute to disease susceptibility?
– Will it rain tomorrow?
•
Parametric models provide much power in the analysis of variation (parameter
estimation, hypothesis testing, model choice, prediction)
– Statistical models of the random variables
– Models of the underlying stochastic process
3
What is a distribution?
•
A distribution characterises the probability (mass) associated with each possible
outcome of a stochastic process
•
Distributions of discrete data characterised by probability mass functions
P( X  x)
 P( X  x)  1
x
0
•
1
2
3
x
Distributions of continuous data are characterised by probability density
functions (pdf)
f (x )

 f ( x)dx  1

x
•
For RVs that map to the integers or the real numbers, the cumulative density
function (cdf) is a useful alternative representation
4
Expectations and variances
•
Suppose we took a large sample from a particular distribution, we might want to
summarise something about what observations look like ‘on average’ and how
much variability there is
•
The expectation of a distribution is the average value of a random variable over a
large number of samples
E ( X )   xP( X  x)
or
x
•
 xf ( x)dx
The variance of a distribution is the average squared difference between
randomly sampled observations and the expected value
Var ( X )   x  E ( x)  P( X  x) or
2
x
2


x

E
(
x
)
f ( x)dx

5
iid
•
In most cases, we assume that the random variables we observe are independent
and identically distributed
•
The iid assumption allows us to make all sorts of statements both about what we
expect to see and how much variation to expect
•
Suppose X, Y and Z are iid random variables and a and b are constants
E ( X  Y  Z )  E ( X )  E (Y )  E ( Z )  3E ( X )
Var ( X  Y  Z )  Var ( X )  Var (Y )  Var ( Z )  3Var ( X )
E (aX  b)  aE ( X )  b
Var (aX  b)  a 2 Var ( X )


Var  1n  X i   1n Var ( X )
 i

6
Where do ‘commonly-used’ distributions come from?
•
At the core of much statistical theory and methodology lie a series of key
distributions (e.g. Normal, Poisson, Exponential, etc.)
•
These distributions are closely related to each other and can be ‘derived’ as the
limit of simple stochastic processes when the random variable can be counted or
measured
•
In many settings, more complex distributions are constructed from these ‘simple’
distributions
– Ratios: E.g. Beta, Cauchy
– Compound: E.g. Geometric, Beta
– Mixture models
7
The simplest model
•
Bernoulli trials
– Outcomes that can take only two values: (0 and 1) with probabilities q and 1 - q
respectively. E.g. coin flipping, indicator functions
•
The likelihood function calculates the probability of the data
P( X | q )   P( X  xi | q ) q k (1  q ) nk
i
•
What is the probability of observing the sequence (if q = 0.5)
– 01001101100111101001?
– 11111111111000000000?
•
Are they both equally probable?
8
The binomial distribution
•
Often, we don’t care about the exact order in which successes occurred. We
might therefore want to ask about the probability of k successes in n trials. This is
given by the binomial distribution
•
For example, the probability of exactly 3 heads in 4 coins tosses =
– P(HHHT)+P(HHTH)+P(HTHH)+P(THHH)
– Each order has the same Bernoulli probability = (1/2)4
– There are 4 choose 3 = 4 orders
•
Generally, if the probability of success is q, the probability of k successes in n trials
n k
P(k | n,q )   q (1  q ) n k
k 
•
n = 10
q = 0.2
The expected number of successes is nq and the variance is nq(1-q)
9
The geometric distribution
•
Bernoulli trials have a memory-less property
– The probability of success (X = 1) next time is independent of the number of successes
in the preceding trials
•
The number of trials between subsequent successes takes a geometric
distribution
– The probability that the first success occurs at the kth trial
P(k | q )  q (1  q ) k 1
•
You can expect to wait an average of 1/q trials for a success, but the variance is
Var (k ) 
q = 0.5
1q
q2
q = 0.05
10
0
20
100
The Poisson distribution
•
The Poisson distribution is often used to model ‘rare events’
•
It can be derived in two ways
– The limit of the Binomial distribution as q→0 and n→∞ (nq = m)
– The number of events observed in a given time for a Poisson process (more later)
•
It is parameterised by the expected number of events = m
– The probability of k events is
m
P(k | m ) 
e m
k!
k
red = Poisson(5)
blue = bin(100,0.05)
•
The expected number of events is m, and the variance is also m
•
For large m, the Poisson is well approximated by the normal distribution
11
Going continuous
•
In many situations while the outcome space of random variables may really be
discrete (or at least measurably discrete), it is convenient to allow the random
variables to be continuously distributed
•
For example, the distribution of height in mm is actually discrete, but is well
approximated by a continuous distribution (e.g. normal)
•
Commonly-used continuous distributions arise as the limit of discrete processes
12
The Poisson process
•
Consider a process when in every unit of time some event might occur
•
E.g. every generation there is some chance of a gene mutating (with probability of
approx 1 in 100,000 )
•
The probability of exactly one change in a sufficiently small interval h ≡ 1/n is P =
vh ≡ v/n, where P is the probability of one change and n is the number of trials.
•
The probability of two or more changes in a sufficiently small interval h is
essentially 0
•
In the limit of the number of trials becoming large the total number of events (e.g.
mutations) follows the Poisson distribution
h
h
13
Time
The exponential distribution
•
In the Poisson process, the time between successive events follows an
exponential distribution
– This is the continuous analogue of the geometric distribution
– It is memory-less. i.e. f(x + t | X > t) = f(x)
f(x)
f ( x |  )  e
E ( x)  1 / 
 x
Var ( x)  1 / 2
x
14
The gamma distribution
•
The gamma distribution arises naturally as the distribution of a series of iid
random exponential variables
4.5
X ~ Exp( )
S  X1  X 2    X n
S ~ Gamma (n,  )
b a a 1 bx
f (x | a , b ) 
x e
(a )
4
3.5
3
a = b = 0.5
2.5
2
a=b=2
1.5
a = b = 10
1
0.5
0
0
0.5
1
1.5
2
2.5
3
•
The gamma distribution has expectation a/b and variance a/b2
•
More generally, a need not be an integer (for example, the Chi-square distribution
with one degree of freedom is a Gamma(½, ½) distribution)
15
The beta distribution
•
The beta distribution models random variables that take the value [0,1]
•
It arises naturally as the proportional ratio of two gamma distributed random
variables
X ~ Gamma (a1 ,q )
Y ~ Gamma (a 2 ,q )
X
~ Beta (a1 , a 2 )
X Y
f (x | a , b ) 
10
9
a = b = 0.5
8
7
6
a = b = 10
5
4
3
a=b=2
2
•
The expectation is a/(a + b)
(a  b ) a 1
x (1  x) b 1
(a )( b )
1
0
0
•
0.2
0.4
0.6
0.8
1
In Bayesian statistics, the beta distribution is the natural prior for binomial
proportions (beta-binomial)
– The Dirichlet distribution generalises the beta to more than 2 proportions
16
The normal distribution
•
The normal distribution is related to most distributions through the central limit
theorem
•
The normal distribution naturally describes variation of characters influenced by a
large number of processes (height, weight) or the distribution of large numbers of
events (e.g. limit of binomial with large np or Poisson with large m)
blue
red
0.045
= Poiss(100)
= N(100,10)
0.04
0.035
0.03
 1 ( x  m )2 
1

f ( x; m ,  ) 
exp  
2
2 
 2 

0.025
0.02
0.015
0.01
0.005
0
50
100
150
17
The exponential family of distributions
•
Many of the distributions covered (e.g. normal, binomial, Poisson, gamma) belong to
the exponential family of probability distributions
•
a k-parameter member of the family has a density or frequency function of the form
k

f ( x;q )  exp  ci (q )Ti ( x)  d (q )  S ( x)
 i 1

•
E.g. the Bernoulli distribution (x = 0 or 1) is
P( X  x)  q (1  q )
x
•
1 x
  q 

 exp  x ln 
  ln( 1  q )
 1q 

Such distributions have the useful property that simple functions of the data, T(x),
contain all the information about model parameter
– E.g. in Bernoulli case T(x) = x
18
Estimating parameters from data
19
Parameter estimation
•
We can formulate most questions in statistics in terms of making statements
about underlying parameters
•
We want to devise a framework for estimating those parameters and making
statements about our certainty
•
There are several different approaches to making such statements
– Moment estimators
– Likelihood
– Bayesian estimation
20
Moment estimation
•
Moment methods are the simplest way of estimating parameters
•
In such techniques parameter values are found that match sample moments
(mean, variance, etc.) to those expected
•
E.g. for random variables X1, X2, etc. sampled from a N(m,2) distribution
X
s 
2
n
1
n
X
1
n
 X
i 1
i
n
i 1
X
2
i
mˆ  X
ˆ 2 
n
n 1
s2
E( X )  m
E (s 2 ) 
n 1
n
2
21
Problems with moment estimation
•
It is not always possible to exactly match sample moments with their expectation
•
It is not clear when using moment methods how much of the information in the
data about the parameters is being used
– Often not much..
•
Why should MSE be the best way of measuring the value of an estimator?
22
Is there an optimal way to estimate parameters?
•
For any model the maximum information about model parameters is obtained by
considering the likelihood function
•
The likelihood function is proportional to the probability of observing the data
given a specified parameter value
•
One natural choice for point estimation of parameters is the maximum likelihood
estimate, the parameter values that maximise the probability of observing the
data
•
The maximum likelihood estimate (mle) has some useful properties (though is not
always optimal in every sense )
23
An intuitive view on likelihood
m  2,  2  1
m  0,  2  1
m  0,  2  4
24
An example
•
Suppose we have data generated from a Poisson distribution. We want to
estimate the parameter of the distribution
•
The probability of observing a particular random variable is
e m m X
P( X ; m ) 
X!
•
If we have observed a series of iid Poisson RVs we obtain the joint likelihood by
multiplying the individual probabilities together
e  m m X1 e  m m X 2
em m X n
P( X 1 , X 2 ,, X n ; m ) 

 
X 1!
X 2!
X n!
L ( m ; X)   e  m m X i
i
L( m ; X)  e  nm m nX
25
Comments
•
Note in the likelihood function the factorials have disappeared. This is because
they provide a constant that does not influence the relative likelihood of different
values of the parameter
•
It is usual to work with the log likelihood rather than the likelihood. Note that
maximising the log likelihood is equivalent to maximising the likelihood
•
We can find the mle of the parameter analytically
L( m ; X)  e  nm m nX
( m ; X)  nm  nX log m
d
nX
 n 
dm
m
mˆ  X
Take the natural log of the
likelihood function
Find where the derivative of
the log likelihood is zero
Note that here the mle is the
same as the moment
estimator
26
Sufficient statistics
•
In this example we could write the likelihood as a function of a simple summary of
the data – the mean
•
This is an example of a sufficient statistic. These are statistics that contain all
information about the parameter(s) under the specified model
•
For example, support we have a series of iid normal RVs
P( X 1 , X 2 , , X n ; m ,  )  
2
 ( X i  m ) 2 / 2 2
1
2 
e


i
L ( m ,  ; X)   e
2
n
1
2 2
( X i  m ) 2
i

( m ,  2 ; X)  n log   2n 2 X 2  2mX  m 2
Mean square
Mean

27
Properties of the maximum likelihood estimate
•
The maximum likelihood estimate can be found either analytically or by numerical
maximisation
•
The mle is consistent in that it converges to the truth as the sample size gets
infinitely large
•
The mle is asymptotically efficient in that it achieves the minimum possible
variance (the Cramér-Rao Lower Bound) as n→∞
•
However, the mle is often biased for finite sample sizes
– For example, the mle for the variance parameter in a normal distribution is the sample
variance
28
Comparing parameter estimates
•
Obtaining a point estimate of a parameter is just one problem in statistical
inference
•
We might also like to ask how good different parameter values are
•
One way of comparing parameters is through relative likelihood
•
For example, suppose we observe counts of 12, 22, 14 and 8 from a Poisson
process
•
The maximum likelihood estimate is 14. The relative likelihood is given by
L ( m ; X)
 n ( m  mˆ )  m 
 
e
L( mˆ ; X)
 mˆ 
nX
29
Using relative likelihood
•
The relative likelihood and log likelihood surfaces are
shown below
30
Interval estimation
•
In most cases the chance that the point estimate you obtain for a parameter is
actually the correct one is zero
•
We can generalise the idea of point estimation to interval estimation
•
Here, rather than estimating a single value of a parameter we estimate a region of
parameter space
– We make the inference that the parameter of interest lies within the defined region
•
The coverage of an interval estimator is the fraction of times the parameter
actually lies within the interval
•
The idea of interval estimation is intimately linked to the notion of confidence
intervals
31
Example
•
Suppose I’m interested in estimating the mean of a normal distribution with
known variance of 1 from a sample of 10 observations
•
I construct an interval estimator
•
The chart below shows how the coverage properties of this estimator vary with a
qˆ  X  a, X  a 
If I choose a to be 0.62 I
would have coverage of 95%
32
Confidence intervals
•
It is a short step from here to the notion of confidence intervals
•
We find an interval estimator of the parameter that, for any value of the
parameter that might be possible, has the desired coverage properties
•
We then apply this interval estimator to our observed data to get a confidence
interval
•
We can guarantee that among repeat performances of the same experiment the
true value of the parameter would be in this interval 95% of the time
•
We cannot say ”There is a 95% chance of the true parameter being in this interval”
33
Example – confidence intervals for normal distribution
•
Creating confidence intervals for the mean of normal distributions is relatively
easy because the coverage properties of interval estimators do not depend on the
mean (for a fixed variance)
•
For example, the interval estimator below has 95% coverage properties for any
mean
mˆ  X  a, X  a 

a  1.96
•
n
There is an intimate link between confidence intervals and hypothesis testing
34
Confidence intervals and likelihood
•
Thanks to the central limit theorem there is another useful result that allows us to
define confidence intervals from the log-likelihood surface
•
Specifically, the set of parameter values for which the log-likelihood is not more
than 1.92 less than the maximum likelihood will define a 95% confidence interval
– In the limit of large sample size the LRT is approximately chi-squared distributed under
the null
•
This is a very useful result, but shouldn’t be assumed to hold
– i.e. Check with simulation
35
Making simple models complex: Adding covariates
36
What is a covariate?
•
A covariate is a quantity that may influence the outcome of interest
– Genotype at a SNP
– Age of mice when measurement was taken
– Batch of chips from which gene expression was measured
•
Previously, we have looked at using likelihood to estimate parameters of
underlying distributions
•
We want to generalise this idea to ask how covariates might influence the
underlying parameters
•
Much statistical modelling is concerned with considering linear effects of
covariates on underlying parameters
37
What is a linear model?
•
In a linear model, the expectation of the response variable is defined as a linear
combination of explanatory variables
Yi  b0  b1 X i  b 2 Zi  b3 X i Zi  ...  
Response
variable
•
Intercept
Linear relationships
with explanatory
variables
Gaussian
error
Explanatory variables can include any function of the original data
Yi  b 0  b1 X  b 2e
2
i
•
Interaction
term
 Zi
 b3 X i  ...  
Zi
But the link between E(Y) and X (or some function of X) is ALWAYS linear and the
error is ALWAYS Gaussian
38
What is a GLM?
•
There are many settings where the error is non-Gaussian and/or the link between
E(Y) and X is not necessarily linear
– Discrete data (e.g. counts in multinomial or Poisson experiments)
– Categorical data (e.g. Disease status)
– Highly-skewed data (e.g. Income, ratios)
•
Generalised linear models keep the notion of linearity, but enable the use of nonGaussian error models
E (Yi )  mi  g 1 b 0  b1 X i  b 2 Z i  b 3 X i Z i  ...
•
g is called the link function
– In linear models, the link function is the identity
•
The response variable can be drawn from any distribution of interest (the
distribution function)
– In linear models this is Gaussian
39
Poisson regression
•
In Poisson regression the expected value of the response variable is given by the
exponent of the linear term
E (Yi )  mi  exp b 0  b1 X i  b 2 Zi  b3 X i Zi  ...
•
The link function is the log
•
Note that several distribution functions are possible (normal, Poisson, binomial
counts), though in practice Poisson regression is typically used to model count
data (particularly when counts are low)
40
Example: Caesarean sections in public and private hospitals
41
Boxplots of rates of C sections
42
Fitting a model without covariates
> analysis<-glm(d$Caes ~ d$Births, family = "poisson")
> summary(analysis)
Call:
glm(formula = d$Caes ~ d$Births, family = "poisson")
Deviance Residuals:
Min
1Q
Median
-2.81481 -0.73305 -0.08718
3Q
0.74444
Max
2.19103
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Implies an average of
(Intercept) 2.132e+00 1.018e-01 20.949 < 2e-16 ***
13 per 1000 births
d$Births
4.406e-04 5.395e-05
8.165 3.21e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 99.990
Residual deviance: 36.415
AIC: 127.18
on 19
on 18
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
43
Fitting a model with covariates
glm(formula = d$Caes ~ d$Births + d$Hospital, family = "poisson")
Deviance Residuals:
Min
1Q
Median
-2.3270 -0.6121 -0.0899
3Q
0.5398
Coefficients:
Estimate Std. Error z
(Intercept) 1.351e+00 2.501e-01
d$Births
3.261e-04 6.032e-05
d$Hospital 1.045e+00 2.729e-01
--Signif. codes: 0 ‘***’ 0.001 ‘**’
•
Max
1.6626
value
5.402
5.406
3.830
Pr(>|z|)
6.58e-08 ***
6.45e-08 ***
0.000128 ***
0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Unexpectedly, this indicates that public hospitals actually have a higher rate of
Caesarean sections than private ones
Implies an average of 15.2 per 1000 births in public hospitals
and 5.4 per 1000 births in private ones
44
Checking model fit
•
Look a distribution of residuals and how well observed values are predicted
45
What’s going on?
•
Initial summary suggested opposite result to GLM analysis. Why?
Normal
Caesarean
Private
412
16
Public
21121
285
Relative risk Private (compared to public) = 2.8
•
Relationship between no. Births and no. C sections does not appear to be linear
•
Accounting for this removes most of the apparent differences between hospital
types
•
There is also one quite influential outlier
46
Simpson’s paradox
•
Be careful about adding together observations – this can be misleading
•
E.g. Berkeley sex-bias case
Applicants
% admitted
Men
8442
44%
Women
4321
35%
Major
Men
Appears that women have lower success
Women
Applicants
% admitted
Applicants
% admitted
A
825
62%
108
82%
B
560
63%
25
68%
C
325
37%
593
34%
D
417
33%
375
35%
E
191
28%
393
24%
F
272
6%
341
7%
But actually women are
typically more successful
at the Departmental
level, just apply to more
competitive subjects
47
48
Finding MLEs in GLM
•
In linear modelling we can use the beautiful compactness of linear algebra to find
MLEs and estimates of the variance for parameters
•
Consider an n by k+1 data matrix, X, where n is the number of observations and k
is the number of explanatory variables, and a response vector Y
– the first column is ‘1’ for the intercept term
•
The MLEs for the coefficients (b) can be estimated using
ˆβ  XT X1 XT Y
•
βˆ ~ N (β, ( XT X) 1 2 )
In GLMs, there is usually no such compact analytical expression for the MLEs
– Use numerical methods to maximise the likelihood
49
Testing hypotheses in GLMs
•
For the parameters we are interested in we typically want to ask how much
evidence there is that these are different from zero
•
For this we need to construct confidence intervals for the parameter estimates
•
We could estimate the confidence interval by finding all parameter values with
log-likelihood no greater than 1.92 units worse than the MLE (2-unit support
interval)
•
Alternatively, we might appeal to the central limit theorem (CLT) and use
bootstrap techniques to estimate the variance of parameter estimates
•
However, we can also appeal to theoretical considerations of likelihood (again
based on the CLT) that show that parameter estimates are asymptotically normal
with variance described by the Fisher information matrix
•
Informally, the information matrix describes the sharpness of the likelihood curve
around the MLE and the extent to which parameter estimates are correlated
50
Logistic regression
•
When only two types of outcome are possible (e.g. disease/not-disease) we can
model counts by the binomial
•
If we want to perform inference about the factors that influence the probability of
‘success’ it is usual to use the logistic model
exp b 0  b1 X i  b 2 Z i  ...
E (Yi ) 
1  exp b 0  b1 X i  b 2 Z i  ...
•
The link function here is the logit
 m 

g ( m )  log 
1 m 
51
Example: testing for genotype association
•
In a cohort study, we observe the number of individuals in a population that get a
particular disease
•
We want to ask whether a particular genotype is associated with increased risk
•
The simplest test is one in which we consider a single coefficient for the genotypic
value
Genotype
AA
Aa
aa
Genotypic
value
0
1
2
Frequency in
population
2
21
12
Probability of
disease
p0
2
b0 = -4
b1 = 2
1
pi 
p1
p2
1
1 e
[ b 0  b1Gi ]
0
52
A note on the model
•
Note that each copy of the risk allele contribute in an additive way to the
exponent
•
This does not mean that each allele ‘adds’ a fixed amount to the probability of
disease
•
Rather, each allele contributes a fixed amount to the log-odds
•
This has the effect of maintaining Hardy-Weinberg equilibrium within both the
cases and controls
 P(disease ) 
  b 0  b1Gi
log odds  log 
 P( Not disease ) 
53
Cont.
•
Suppose in a given study we observe the following counts
Genotype
0
1
2
Counts with
disease
26
39
21
1298
567
49
Counts without
disease
•
We can fit a GLM using the logit link function and binomial probabilities
•
We have genotype data stored in the vector gt and disease status in the vector
status
•
Using R, this is specified by the command
– glm(formula = status ~ gt, family = binomial)
54
Interpreting results
Measure of contribution of
individual observations to overall
goodness of fit (for MLE model)
MLE for coefficients
Call:
glm(formula = status ~ gt, family = binomial)
Standard error for estimates
Deviance Residuals:
Min
1Q
Median
-0.8554 -0.4806 -0.2583
Estimate/std. error
3Q
-0.2583
Max
2.6141
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.6667
0.2652 -17.598
<2e-16 ***
gt
1.2833
0.1407
9.123
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P-value from normal distribution
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 967.36
Residual deviance: 886.28
AIC: 890.28
on 1999
on 1998
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
Measure of goodness of fit of null
(compared to saturated model)
Measure of goodness of fit of fitted
model
Penalised likelihood used in model
choice
Number of iterations used to find
MLE
55
Adding in extra covariates
•
Adding in additional explanatory variables in GLM is essentially the same as in
linear model analysis
•
Likewise, we can look at interactions
•
In the disease study we might want to consider age as a potentially important
covariate
glm(formula = status ~ gt + age, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.00510
0.79209 -8.844
<2e-16 ***
gt
1.46729
0.17257
8.503
<2e-16 ***
age
0.04157
0.01903
2.185
0.0289 *
56
Adding model complexity
•
In the disease status analysis we might want to generalise the fitted model to one
in which each genotype is allowed its own risk
pi 
1
1 e
[ b 0  b1I Gi 1  b1I Gi 2 ]
glm(formula = status ~ g1 + g2 + age, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.46870
0.73155 -7.476 7.69e-14 ***
g1TRUE
1.25694
0.26278
4.783 1.72e-06 ***
g2TRUE
3.00816
0.33871
8.881 < 2e-16 ***
age
0.04224
0.01915
2.206
0.0274 *
--Null deviance: 684.44 on 1999 degrees of freedom
Residual deviance: 609.09 on 1996 degrees of freedom
AIC: 617.09
57
Model choice
•
It is worth remembering that we cannot simply identify the ‘significant’
parameters and put them in our chosen model
•
Significance for a parameter tests the marginal null hypothesis that the coefficient
associated with that parameter is zero
•
If two explanatory variables are highly correlated then marginally neither may be
significant, yet a linear model contains only one would be highly significant
•
There are several ways to choose appropriate models from data. These typically
involve adding in parameters one at a time and adding some penalty to avoid
over-fitting.
58