* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Nucleic acid analogue wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Group selection wikipedia , lookup
Frameshift mutation wikipedia , lookup
Population genetics wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Microevolution wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Point mutation wikipedia , lookup
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004 Today • McClellan and McCracken: Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains • Dagan et al: Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional FActors and May Not be Indicative of Positive Darwinian Selection • Halpert and Bruno: Evolutionary Distances for ProteinCoding Sequences: Modeling Site-Specific Residue Frequencies Types of Selection • negative purifying selection – non-synonymous codon changes are selected against • neutral selection – non-synonymous changes in codons have an equivalent probability of elimination or fixation • positive diversifying selection – non-synonymous codon changes are selected for Identifying Regions Under Selective Pressure • ds/dn << 1 and ds/dn >> 1 commonly used • synonymous substitutions become saturated more quickly than ns • compare conservative/radical substitution ratio to expected distribution under neutral model A “conservative” definition • Cluster amino acids according to physiochemical properties – – – – – Charge Volume Polarity Grantham’s distance ... • Within-class = conservative • Across-class = radical Assessing Substitution Rates • 2 sequences – average over all possible pathways between two codons – TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg) • Many sequences – Build a phylogenetic tree – Infer most likely ancestral sequences – Count synonymous and nonsynonymous substitutions Cytochrome b Gene Evolution • Matrix and Transmembrane regions have comparable rates of change • Intermembrane region has lower rate of change (McClelland and McCracken) Group Non-Syn Mutations • 5 Properties • 4 Groups • Neutral model – based only on codon frequencies • Chi-squared test – observed vs. expected (given domain amino acid frquencies) Question • Do factors unrelated to selection affect the radical/conservative ratio? – nucleotide frequencies • e.g. GC content – transition/transversion ratio • transitions (A->G and T->C) are more common than transversion – distances between amino acids • genetic code – codon biases • due to tRNA availibility, energy usage, or pathogen avoidance – amino acid frequencies • ?? An Initial Test • 3 proteins: Hemoglobin, Interleukin, Ribosomal protein • Simulated neutral evolution using substutition matrix built from psuedogenes • Tested for selection pressure – volume/polarity: 100% FP – grantham: 13-21% FP – charge: 0% FP (Dagan et al) Simulation Study • Generate virtual ancestral sequence – 300 nt long • Set mutational/compositional parameters • Simulate evolution (ROSE software) – 50 substitutions • Calculate conservative/radical ratio • Each parameter set simulated 50 times ANOVA Conclusion • Many composition and mutation factors influence conservative/radical ratio • Poor indicator of positive selection Correlation or Causation? • Many factors are correlated, but direction of causation is undetermined – transitions more likely to cause conservative changes than transversions – codon bias can influence nucleotide frequencies – purifying selective pressure will reduce the rate of change • Generative models which model many of these relevant factors Generative Models of Gene/Protein Evolution • Infer relative distances between sequences • Build a phylogenetic tree • Infer which positions are under positive selective pressure • Find additional homologous proteins • Identify co-varying sites Modeling Evolutionary Processes • Most models – homogeneous, timereversible Markov models • Simplest models – DNA mutation models – nucleotide frequencies – transition/transversion ratio Too Simplistic • positions within codons are not independent – codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters – empirical substitution matrix (e.g. PAM) • site-specific rates of change – use a gamma distribution to model variation in rates Too Simplistic • positions within codons clearly not independent – codon or amino acid models • different rates of change between specific characters – empirical substitution matrix (e.g. PAM) • site-specific rates of change – use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific – due to functional or structural constraints Too Simplistic • positions within codons are not independent – codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters – empirical substitution matrix (e.g. PAM) • site-specific rates of change – use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific – due to functional or structural constraints Halpern & Bruno 1998 A codon-based model of evolution 1. site-invariant dna-based mutation model 2. site-specific amino acid level selection model pab = probability of mutation f abi = probability of fixation at site i rabi k pab f abi ,b a raai rabi b ,b a Halpern & Bruno 1998 • Assumptions – most importantly, selectional pressures are constant at a given position for all lineages over all times – sites independent – markov process is reversible • Does not model – selection at the codon level • codon bias • DNA or RNA structural requirements – uncertainty in MSA Calculating fixation rates s N f ab relative fitness of b to a population size 2s 2 Ns 1 e f ab 2 Ns e f ba (Kimura 1962) Fixation rates in terms of equilibrium rates and mutation probabilities s N f ab relative fitness of b to a population size 2s 2 Ns 1 e b pba f ab 2 Ns e a pab f ba (Kimura 1962) A Simpler Formulation rab k pab ln( 1 b pba a pab a pab b pba ) note: rab rba b a • p is estimated from nucleotide frequencies and the transition/transversion ratio • π represents the frequency of each codon, and is approximated via amino acid and nucleotide frequencies • model ignores: – site-specific nucleic acid selection effects (e.g. from RNA structure) – codon bias Model Fallout • Amount of “flux” between two codons depends on their relative fitness • Rates are not explicitly modeled, but... – maximum substitution rate will be when all codons are equally fit – synonymous codons will have highest flux – because of degeneracy of 3rd position changes, they will be most frequent Parameter Estimation • Ideal – estimate parameters simultaneously from large data set • What they did – nucleotide frequencies: from observed frequencies – transition/transversion ratio: using existing nucleotidebased methods – equilibrium amino acid frequencies: • estimate number of times each amino acid was introduced at each position (based on phylogenetic tree but ignores genetic code) • add psuedo-counts Evaluation • Their hypothesis: – methods that only model differing rates will underestimate more remote divergence times • Test hypothesis on simulated data – given an MSA • estimate the tree (multiplied branch lengths by 6.0) • estimate amino acid frequencies – arbitrarily choose mutational parameters – stochastically generate sequences (how many?) Predicting Distances Between Sequences A: DNA model (learned?) B: DNA model with site-rate variation C: this model with simulation parameters D: this model with parameters estimated from simulated data x axis: estimated distances y axis: true distances (based on simulation) Conclusions • failing to model selection effects leads to substantial underestimation of longer distances • possible to estimate equilibrium amino acid frequencies from realistic data sets with an accuracy sufficient for estimating distances between highly divergenct sequences • model accounts for heterogeneity of rates in a novel, and more biologically realistic way • model parameters could in theory be estimated simultaneously using ML or bayesian estimation