* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 20070903115012101
Molecular evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome evolution wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Community fingerprinting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Modelling heterogeneity in multi-gene data sets Klaus Schliep, Barbara Holland, Mike Hendy, David Penny Allan Wilson Centre Palmerston North, NZ Motivation • Phylogenomic datasets may involve hundreds of genes for many species. • These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes. • One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree. Example Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa S. cerevisiae S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. kluyveri S. castellii C. albicans Two extremes • How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ? Between 13 (consensus tree) & 13 x 106 = 1378 • Too few parameters introduces bias • Too many parameters increases the variance Stochastic partitioning • Attempts to cluster genes into classes that have evolved in a similar fashion. • Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution) Algorithm overview 1. Randomly assign the n genes to k classes. 2. Optimise parameters for each class 3. Compute the posterior probability for each gene with the parameters from each class. 4. Move each gene into the class for which it has highest posterior probability 5. Go to step 2, when no genes change class STOP How many classes? Gene ontology A different approach • Allow each gene to have its own set of parameters • BUT penalise models where the parameters are too different from each other. Penalized (log-)likelihood 106 pl ( , x) l ( i , xi ) 12g ( ) i 1 g ( ) i j K 2 i j T 2 where i are the parameters for the i-th gene tree, K is a symmetric matrix, and is the penalty term. Number of parameters • Hastie and Tibshirani (1990) give an approximation for the number of degrees of freedom for a penalized likelihood estimator: df tr(( H K ) 1 H ) l ( , x) H t 2 • This allows us to choose the best λ value using AIC or BIC. Summary • Tame statisticians are useful too!