Download LinearModellingII - Wellcome Trust Centre for Human Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Linear Modelling II
Richard Mott
Wellcome Trust Centre for Human
Genetics
Recap
• So far, we have learnt about
–
–
–
–
Correlation
Linear Regression
One-Way Analysis of Variance
Non-parametric alternatives
• In this lecture we will cover
–
–
–
–
–
–
Relationship between Least Squares and Max Likelihood
Fitting complex linear models
Testing the fits of models
Comparing Models
Checking Goodness of Fit
Prediction
The Likelihood
•
•
Likelihood = “probability of the data given the model”
Basis of parametric statistical inference
L(q )   Pr(y i | q )
i
•
•
Different hypotheses can often be expressed in terms of different values of the
parameters q
Hypothesis testing equivalent to comparing the likelihoods for different q

The Likelihood Ratio Test
•
General Framework for constructing hypothesis tests:
S = Likelihood( data | H1) / Likelihood( data | H0)
Reject H0 if S > s(H0, H1)
Threshold s is chosen such that there is a probability a of making a false rejection under H0.
a
is the size of the test (or false positive rate or Type I error) e.g. a=0.05

is the power of the test, the probability of correctly rejecting H0 when H1 is true e.g. =0.8
1-  is the type II error, the false negative rate
Generally, for fixed sample size n, if we fix a then we can’t fix 
If we fix a and then we must let n vary.
The Neyman Pearson Lemma states that the likelihood ratio test is the most powerful of all tests
of a given size
Type III error: "correctly rejecting the null hypothesis for the wrong reason".
Maximum Likelihood Principle
• Often a hypothesis does not completely specify all the parameters
in the model
–
–
–
–
–
–
eg comparing mean of a sample to 0
don’t specify the variance in the model
H0 : data are distributed as N(0,s2)
H1 : data are distributed as N(m, s2)
assume Gaussian (Normal) errors
Have to estimate the free parameters s,m to perform test
• Maximum Likelihood:
– first choose estimates for any free parameters to maximise the
likelihood
– Use the maximised likelihood in the LR test
– ML estimates of parameters have good properties in large samples
Asymptotic Distribution of loglikelihood ratio statistics
• H0 is a submodel of H1
• parameters q
• are fixed qo under H0
• unknown q1 under H1, estimated by max likelihood
• max likelihood estimates under H1 are qˆ
– ie  log L(q )

0
q
at
ˆ
q  q



Taylor expansion of log-likelihood ratio

ˆ
ˆ
log L(q) log L(q 0 )  (q q 0 ) log L(q 0 )
q
1 ˆ

ˆ  q )  ...
 (q q 0 )' 2 log L(q 0 )(q
0
2
q
2

2
1

ˆ ) log L(q )  (q
ˆ  q )'
ˆ q )
log L(q
log
L(
q
)(
q
0
0
0
0
2
q 2
a quadratic form with rank k (= difference in number of free parameters)

Asymptotic distribution of 2 * log-likelihood ratio is

2
k
Example: A Simple Contingency Table
• Objects are classified into K classes
– Probability of class i is pi
• Ho: Probability of each class is the same = 1/K
• H1: Probability of classes are different
• Data: Ni observed in class i
Contingency Table
Likelihood-ratio test
p
Ni
i
L
N i!
Likelihood
i
N!
log( L)  N i log pi C
log-Likelihood
i
log( L | H 0 )  N i log(1/K)  C
log-Likelihood under Ho
i
max(log( L | H1 )  N i log( N i /N)
maximised log-Likelihood under H1
i
2(max(log( L | H1) log( L | H 0 ))  2 N i log( N iK /N)  2 Oi log( Oi / E i )
i
i
2*log-Likelihood ratio
(Oi  Ei )2 / Ei
i
Compare to chi-square test, both distributed as

2
K 1
Least Squares and Maximum Likelihood
Density function of Normal distribution
Likelihood for linear regression model
Collect terms, converting to a sum
Take logarithms
Maximising the log Likelihood with respect to a, b is
identical to minimising the residual sum of squares (RSS)
Least Squares = Maximum Likelihood for Normal Data
MLE’s for a, b are the same as LS estimates.
The MLE for s2 RSS/n (slightly biased estimate)
Residual Mean Square = RSS/(n-1)
The general linear model
• A linear model with multiple explanatory
variables
– e.g. in the Biochemistry data set, HDL (y) may depend on
Tot.Cholesterol (x) and GENDER (z)
yi = a + b1xi + b2 zi + ei
– In general:
yi = b0 + b1x1i + b2 x2i + ….. + bp xpi + ei
b0 is the intercept (grand mean)
E(yi)= b0 + b1x1i + b2 x2i + ….. + bp xpi
ei ~ N(0,s2)
linear predictor
errors are iid Normal
Example: One-Way ANOVA
In original formulation, data are grouped in a table
Rewrite the data as a single vector of numbers y1 …. yn
and define p indicator variables, one for each group k=1..p:
In new formulation, data are a vector but there are p+1 variables
The Design Matrix
Linear Models are best understood using matrix notation.
The Design Matrix, X captures all the information about the explanatory variables
n
p+1
rows
columns
Then, using matrix multiplication, the linear model is written as
(subjects)
(variables)
General Least Squares Solution
The least squares estimators in the general linear model minimise
There is a “solution”: if the square matrix
is invertible then
is the transpose of
If the matrix is not invertible then the model is over-parameterised,
ie some of the explanatory variables can be constructed from combinations
of the other variables. Then it is necessary to reduce the design matrix by
removing columns until
is invertible.
This process is done automatically in R, but can result in some parameters
being thrown out. A factor with q levels will only require q-1 parameters.
Equivalent Models: re-parameterisation
One area of difficulty is that two apparently different models can produce
identical fits to the data, even though the parameter estimates may look
completely different
Mathematically, this can happen if the design matrix for one model can be
derived from the other by forming appropriate linear combinations of the columns
This is called re-parameterisation:
One example is the reduction in columns sometimes necessary to make
invertible
Another example: fitting a factor with just two levels (eg sex) is equivalent to
coding sex numerically by the number of X chromosomes (ie 1 or 2).
The parameter estimates will appear different but the ANOVA and fitted values
will be identical.
Another example: fitting a factor with q levels usually requires q-1 parameters because
one level can be constructed from the others plus the overall mean.
However, one can omit the mean, in which case the number of parameters for the factor increases to q
Model Formulae in R
If X is an R data frame with columns named “y”, “A”, “B” “C” etc,
then we can form model formulae
y
y
y
y
~
~
~
~
A
A + B
A + B + C
B + C
etc
If a column is numeric then adding a term for the column will mean
an additional parameter must be estimated
If the column is a factor with q levels then the column is implicitly expanded
as q-1 indicator variables, and q-1 additional parameters must be estimated
Warning: if an factor column is accidentally interpreted as numeric then
the model fit will almost certainly be wrong !
Model Formulae in R
By default, the overall mean is included in an R model formula.
It can be included explicitly:
y ~ 1
y ~ 1 + A
(just fit a model with the overall mean, the null model)
(fit the overall mean plus A)
or excluded:
y ~ A -1
In the latter case:
if A is numeric, this corresponds to a linear regression
with intercept=0, ie the line of fit is forced to pass through the origin (0,0)
if A is a factor, this model is a reparameterisation of y ~ A. the fit is the same
but the parameter estimates look different.
Additive Models
Consider a model with two factors
A (Sex: male vs female) and
B (Treatment: treated vs untreated):
y~A+B
This means that the effects of sex and treatment act additively and independently
of each other, for example
being female might add +1 unit to the average response compared to males,
regardless of treatment,
being in the treatment group might add +2 units, regardless of sex.
Overall, two parameters are needed for the model
male
female
no treatment
0+0=0
1+0=1
treatment
0+3=3
1+3=4
Interactions
Now consider a model with two factors
A (Sex: male vs female) and
B (Treatment: treated vs untreated):
y~A*B
This means that the effects of sex and treatment interact with each other, for example
being female with treatment adds +2 units to the average response
being female with no treatment adds 1 unit
being male with treatment adds +4 units
being male with no treatment adds 0 units
Now three parameters are required to model the data
male
female
no treatment
0
1
treatment
4
2
Comparing Nested Linear Models
the additive model
y ~ A + B
f.additive <- lm(y ~ A + B )
is a submodel of (nested in) the interaction model
y ~ A * B
f.interaction <- lm( y ~ A * B )
This means that the additive model is a special case of the interaction model,
corresponding to setting certain parameters to be 0
R will let us compare nested models, to see if there is evidence for an interaction:
a <- anova(f.additive, f.interaction)
Comparing Nested Models:
The Partial F-test
A model M1 is nested within M2 (M1 is a submodel of M2) if
M1 corresponds to forcing some of the parameters in M2 to 0
[ More generally, the n x p1 design matrix X1 for M1 satisfies
X1 = A X2
where X2 is the n x p2 design matrix for M2 and A is a p2 x p1 matrix ]
The partial F test is used to compare the models:
1. Fit the models M1, M2 with fitting SS FSS1 <= FSS2
2. If M1 is true then the additional p2-p1 parameters in the model contribute
nothing of substance, and the extra fitting SS = FSS2 – FSS1 is “stolen” from
RSS1 and should be distributed as a residual sum of squares with p2-p1 df
3. Therefore Fpartial = [(FSS2 – FSS1)/(p2-p1)]/[ RSS2/(n-p2) ] should be distributed
like an F(p2-p1, n-p2 ) random variable.
4. The partial F test is an equivalent of the likelihood-ratio test; it is exact if the
errors are Normally distributed.
Example (simulated data)
> df <- read.table(“int.txt”, h=TRUE)
> f.add.add <- lm( y.add ~ sex + treat )
> anova(f.add.add)
Analysis of Variance Table
# simulated to only have additive effects
Response: y.add
Df Sum Sq Mean Sq F value
Pr(>F)
sex
1 27.043 27.043
9.5686 0.002586 **
treat
1 312.262 312.262 110.4870 < 2.2e-16 ***
Residuals 97 274.144
2.826
> f.add.int <- lm( y.add ~ sex * treat )
> anova(f.add.int)
Analysis of Variance Table
Response: y.add
Df Sum Sq Mean Sq F value
Pr(>F)
sex
1 27.043 27.043
9.5124 0.002666 **
treat
1 312.262 312.262 109.8378 < 2.2e-16 ***
sex:treat 1
1.223
1.223
0.4300 0.513538
Residuals 96 272.921
2.843
> anova(f.add.add, f.add.int)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
97
2
96
>
>
y.add ~ sex + treat
y.add ~ sex * treat
RSS Df Sum of Sq
F Pr(>F)
274.144
272.921 1
1.223 0.43 0.5135
# fitting an interaction
# interaction not significant
# comparison of models
Example (simulated data)
> df <- read.table(“int.txt”, h=TRUE)
> f.int.int <- lm( y.int ~ sex * treat )
> anova(f.int.int)
Analysis of Variance Table
Response: y.int
Df Sum Sq Mean Sq F value
Pr(>F)
sex
1 169.34 169.34 46.783 7.376e-10 ***
treat
1 54.74
54.74 15.121 0.0001857 ***
sex:treat 1 252.93 252.93 69.876 4.909e-13 ***
Residuals 96 347.49
3.62
f.add.int <- lm( y.add ~ sex * treat )
> anova(f.add.int)
Analysis of Variance Table
# simulated to have an interaction
# main effect for sex
# main effect for treatment
# interaction for sex and treatment
# simulated to only have additive effects
Response: y.add
Df Sum Sq Mean Sq F value
Pr(>F)
sex
1 27.043 27.043
9.5124 0.002666 **
treat
1 312.262 312.262 109.8378 < 2.2e-16 ***
sex:treat 1
1.223
1.223
0.4300 0.513538
Residuals 96 272.921
2.843
> anova(f.int.add,f.int.int)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
97
2
96
y.int ~ sex + treat
y.int ~ sex * treat
RSS Df Sum of Sq
F
Pr(>F)
600.42
347.49 1
252.93 69.876 4.909e-13 ***
# comparison of additive and interaction
Digression: Experimental Design
• The two factor model is a simple example of an experimental
design,
– the objective is to distribute the individuals across different factor
combinations so that we can estimate effects efficiently at minimum
cost
– In this example, equal numbers were assigned to each factor
combination, a balanced orthogonal design
– Balance = equal numbers of replicates in each factor combination
– Orthogonal = estimates of a factor’s main effects are unaffected by
adding other factors or interactions (in the example, the columns
representing sex and treatment are uncorrelated)
• Experimental design is a vast subject –
– Optimal design depends on what is wanted
– Have to distinguish between factors we can control and random
factors.
Mixing Factors and Numeric Variables in R
Consider the R models
y ~ A + X
y ~ A * X
(1)
(2)
where A is a factor (with p levels) and X is numeric.
What do they represent?
(1) will fit a series of p parallel linear regression lines
(2) will fit a series of p linear regression lines, with no
constraint on being parallel
Model (1) is nested inside (2) so they can be compared
with a partial F-test
anova(fit1,fit2) tests the hypothesis of parallel lines
Interpreting Model Comparison:
multicolinarity
•
Suppose we fit the models (X1, X2 are numeric)
Y ~ X1
Y ~ X2
Y ~ X1 + X2
(1)
(2)
(3)
•
•
Often we find
FSS(Y ~ X1 + X2) < FSS(Y ~ X1) + FSS(Y ~ X2)
•
•
The discrepancy is because in general X1, X2 are correlated with each other: only when cor(X1, X2) = 0 (we say they
are orthogonal does
FSS(Y ~ X1 + X2) = FSS(Y ~ X1) + FSS(Y ~ X2)
•
•
And in extreme case when |cor(X1, X2)| = 1
FSS(Y ~ X1 + X2) = FSS(Y ~ X1) = FSS(Y ~ X2)
•
This makes the interpretation of model comparisons difficult: in general there will be a part of the FSS which could
be explained by either X1, X2 and it is impossible to determine which is correct.
•
Multicolinearity becomes a serious problem in models with many terms
•
Another way of thinking about this problem is that it is good to design an experiment such that the X’s are
orthogonal, although this is not possible if the X’s are observed quantities.
Model Diagnostic Plots
Observed vs Residuals
In this example, negative residuals
are associated with small values of y
indicating poor model fit
Comparing Non-nested models
•
•
•
•
•
There is no equivalent to the partial F test when comparing non-nested models.
The main issue is how to compensate for the fact that a model with more
parameters will tend to fit the data better (in the sense of explaining more
variance) than a model with fewer parameters
But a model with many parameters tends to be a poorer predictor of new
observations, because of over-fitting
However, several criteria have been proposed for model comparison
AIC (Akaike Information Criterion)
where L = the likelihood for the data, p= # parameters
For Linear Models,
where n is the number of observations and L is the maximized value of the likelihood function
for the estimated model.
Among models with the same number of observations n, choose the model which minimises
the simplified AIC
• BIC (Bayesian Information Criterion)
• The BIC penalizes p more strongly than the AIC
- ie it prefers models with fewer paramters
• In R:
f <- lm( formula, data)
aic <- AIC(f)
bic <- AIC(f, k = log(nrow(data)))
Building Models Automatically
•
Example:
–
–
–
–
–
We wish to find a set of SNPs that jointly explain variation in a quantitative trait
There will be (usually) far more SNPs s than data points n
There are a vast number 2s of possible models
No model can contain more than n-1 parameters (model saturation)
Forward Selection:
•
•
–
Start with a small model and augment the model step by step so long as the improvement in fit
satisfies a criterion (eg AIC or BIC, or partial F test)
At each step, add the variable which maximises the improvement in fit
Backwards Elimination:
•
•
–
Start with a very large model and subtract terms from the model step by step
At each step, delete the variable that minimises the decrease in fit
Forward-Backward
•
•
–
At each step, either add or delete a term depending on which optimises a criterion
in R, stepAIC
Model Averaging
•
Rather than try to find a single model, integrate over many plausible models
–
–
Bootstrapping
Bayesian Model Averaging
Random Effects
Mixed Models
• So far our models have had fixed effects
– each parameter can take any value independently of the
others
– each parameter estimate uses up a degree of freedom
– models with large numbers of parameters have problems
•
•
•
•
saturation - overfitting
poor predictive properties
numerically unstable and difficult to fit
in some cases we can treat parameters as being sampled from a
distribution – random effects
• only estimate the parameters needed to specify that distribution
• can result in a more robust model
Example of a Mixed Model
• Testing genetic association across a large number
of of families
–
–
–
–
–
yi = trait value of i’th individual
bi = genotype of individual at SNP of interest
fi = family identifier (factor)
y=m+b+f+e
If we treat family effects f as fixed then we must
estimate a large number of parameters
– Better to think of these effects as having a distribution
N(0,t2) for some variance t
– genotype parameters b treated as fixed effects
Some Essential R tips
Comparing Models containing Missing Data
– R will silently omit rows of data containing missing elements, and adjust the df
accordingly
– R will only compare models using anova() if the models have been fitted to
identical observations.
– Sporadic missing values in the explanatory variables can cause problems,
because models may have a different numbers of complete cases
– Solution is to use the R function complete.cases() to identify those rows with
complete data in the most general model, and specify these rows for
modelling:
cc <- complete.cases(data.frame)
f <-lm( formula, data, subset=cc)
Joining two data frames on a common column
– Frequently data to be analysed are in several data frames, with a column of unique subject ids
used to match the rows.
– Unfortunately, often the subjects differ slightly between data frames, eg frame1 might contain
phenotype data and frame2 contain genotypes. Usually these two sets don’t agree perfectly
because there will be subjects with genotypes but no phenotypes and vice versa
– Solution 1: Make a combined data frame containing the intersection of the subjects
intersect <- frame1$id[match(frame2$id, frame1$id, nomatch = 0)]
combined <- cbind( frame1[match(intersect,frame1$id),cols1],
frame2[match(intersect,frame2$id),cols2])
– Solution 2: Make a combined data frame containing the union of the subjects
union <- unique(sort(c(as.character(frame1$id), as.character(frame2$id)))
combined <- cbind(
frame1[match(union,frame1$id),cols1],frame2[match(union,frame2$id),cols2])
– cols1 is a list of the column names to include in frame1,
– cols2 is a list of the column names to include in frame2