Download 20070903115012101

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene regulatory network wikipedia , lookup

Transcript
Modelling heterogeneity in
multi-gene data sets
Klaus Schliep, Barbara Holland,
Mike Hendy, David Penny
Allan Wilson Centre
Palmerston North, NZ
Motivation
• Phylogenomic datasets may involve hundreds of
genes for many species.
• These data sets create challenges for current
phylogenetic methods, as different genes have
different functions and hence evolve under
different processes.
• One question is how best to model this
heterogeneity to give reliable phylogenetic
estimates of the species tree.
Example
Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa
S. cerevisiae
S. paradoxus
S. mikatae
S. kudriavzevii
S. bayanus
S. kluyveri
S. castellii
C. albicans
Two extremes
• How many parameters do we need to
adequately represent the branches of all
(unrooted) gene trees ?
Between
13 (consensus tree)
&
13 x 106 = 1378
• Too few parameters introduces bias
• Too many parameters increases the variance
Stochastic partitioning
• Attempts to cluster genes into classes that
have evolved in a similar fashion.
• Each class is allowed its own set of
parameters (e.g. branch lengths or model
of nucleotide substitution)
Algorithm overview
1. Randomly assign the n genes to k classes.
2. Optimise parameters for each class
3. Compute the posterior probability for each
gene with the parameters from each class.
4. Move each gene into the class for which it has
highest posterior probability
5. Go to step 2, when no genes change class
STOP
How many classes?
Gene ontology
A different approach
• Allow each gene to have its own set of
parameters
• BUT penalise models where the
parameters are too different from each
other.
Penalized (log-)likelihood
106
pl ( , x)   l ( i , xi )  12g ( )
i 1
g ( )    i   j   K
2
i j
T
2
where i are the parameters for the i-th gene tree,
K is a symmetric matrix, and  is the penalty term.
Number of parameters
• Hastie and Tibshirani (1990) give an
approximation for the number of degrees
of freedom for a penalized likelihood
estimator: df  tr(( H  K ) 1 H )
 l ( , x)
H 
 t
2
• This allows us to choose the best λ value
using AIC or BIC.
Summary
• Tame statisticians are useful too!