Download Document

Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004 Today • McClellan and McCracken: Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains • Dagan et al: Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional FActors and May Not be Indicative of Positive Darwinian Selection • Halpert and Bruno: Evolutionary Distances for ProteinCoding Sequences: Modeling Site-Specific Residue Frequencies Types of Selection • negative purifying selection – non-synonymous codon changes are selected against • neutral selection – non-synonymous changes in codons have an equivalent probability of elimination or fixation • positive diversifying selection – non-synonymous codon changes are selected for Identifying Regions Under Selective Pressure • ds/dn << 1 and ds/dn >> 1 commonly used • synonymous substitutions become saturated more quickly than ns • compare conservative/radical substitution ratio to expected distribution under neutral model A “conservative” definition • Cluster amino acids according to physiochemical properties – – – – – Charge Volume Polarity Grantham’s distance ... • Within-class = conservative • Across-class = radical Assessing Substitution Rates • 2 sequences – average over all possible pathways between two codons – TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg) • Many sequences – Build a phylogenetic tree – Infer most likely ancestral sequences – Count synonymous and nonsynonymous substitutions Cytochrome b Gene Evolution • Matrix and Transmembrane regions have comparable rates of change • Intermembrane region has lower rate of change (McClelland and McCracken) Group Non-Syn Mutations • 5 Properties • 4 Groups • Neutral model – based only on codon frequencies • Chi-squared test – observed vs. expected (given domain amino acid frquencies) Question • Do factors unrelated to selection affect the radical/conservative ratio? – nucleotide frequencies • e.g. GC content – transition/transversion ratio • transitions (A->G and T->C) are more common than transversion – distances between amino acids • genetic code – codon biases • due to tRNA availibility, energy usage, or pathogen avoidance – amino acid frequencies • ?? An Initial Test • 3 proteins: Hemoglobin, Interleukin, Ribosomal protein • Simulated neutral evolution using substutition matrix built from psuedogenes • Tested for selection pressure – volume/polarity: 100% FP – grantham: 13-21% FP – charge: 0% FP (Dagan et al) Simulation Study • Generate virtual ancestral sequence – 300 nt long • Set mutational/compositional parameters • Simulate evolution (ROSE software) – 50 substitutions • Calculate conservative/radical ratio • Each parameter set simulated 50 times ANOVA Conclusion • Many composition and mutation factors influence conservative/radical ratio • Poor indicator of positive selection Correlation or Causation? • Many factors are correlated, but direction of causation is undetermined – transitions more likely to cause conservative changes than transversions – codon bias can influence nucleotide frequencies – purifying selective pressure will reduce the rate of change • Generative models which model many of these relevant factors Generative Models of Gene/Protein Evolution • Infer relative distances between sequences • Build a phylogenetic tree • Infer which positions are under positive selective pressure • Find additional homologous proteins • Identify co-varying sites Modeling Evolutionary Processes • Most models – homogeneous, timereversible Markov models • Simplest models – DNA mutation models – nucleotide frequencies – transition/transversion ratio Too Simplistic • positions within codons are not independent – codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters – empirical substitution matrix (e.g. PAM) • site-specific rates of change – use a gamma distribution to model variation in rates Too Simplistic • positions within codons clearly not independent – codon or amino acid models • different rates of change between specific characters – empirical substitution matrix (e.g. PAM) • site-specific rates of change – use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific – due to functional or structural constraints Too Simplistic • positions within codons are not independent – codon or amino acid models • parameters not sufficient to explain different rates of change between specific characters – empirical substitution matrix (e.g. PAM) • site-specific rates of change – use a gamma distribution to model variation in rates • equilibrium frequencies are also site-specific – due to functional or structural constraints Halpern & Bruno 1998 A codon-based model of evolution 1. site-invariant dna-based mutation model 2. site-specific amino acid level selection model pab = probability of mutation f abi = probability of fixation at site i rabi  k  pab  f abi ,b  a raai    rabi b ,b  a Halpern & Bruno 1998 • Assumptions – most importantly, selectional pressures are constant at a given position for all lineages over all times – sites independent – markov process is reversible • Does not model – selection at the codon level • codon bias • DNA or RNA structural requirements – uncertainty in MSA Calculating fixation rates s N f ab relative fitness of b to a population size 2s   2 Ns 1 e f ab 2 Ns e f ba (Kimura 1962) Fixation rates in terms of equilibrium rates and mutation probabilities s N f ab relative fitness of b to a population size 2s   2 Ns 1 e  b pba f ab 2 Ns  e  a pab f ba (Kimura 1962) A Simpler Formulation rab  k  pab  ln( 1  b pba  a pab  a pab  b pba ) note: rab rba   b  a • p is estimated from nucleotide frequencies and the transition/transversion ratio • π represents the frequency of each codon, and is approximated via amino acid and nucleotide frequencies • model ignores: – site-specific nucleic acid selection effects (e.g. from RNA structure) – codon bias Model Fallout • Amount of “flux” between two codons depends on their relative fitness • Rates are not explicitly modeled, but... – maximum substitution rate will be when all codons are equally fit – synonymous codons will have highest flux – because of degeneracy of 3rd position changes, they will be most frequent Parameter Estimation • Ideal – estimate parameters simultaneously from large data set • What they did – nucleotide frequencies: from observed frequencies – transition/transversion ratio: using existing nucleotidebased methods – equilibrium amino acid frequencies: • estimate number of times each amino acid was introduced at each position (based on phylogenetic tree but ignores genetic code) • add psuedo-counts Evaluation • Their hypothesis: – methods that only model differing rates will underestimate more remote divergence times • Test hypothesis on simulated data – given an MSA • estimate the tree (multiplied branch lengths by 6.0) • estimate amino acid frequencies – arbitrarily choose mutational parameters – stochastically generate sequences (how many?) Predicting Distances Between Sequences A: DNA model (learned?) B: DNA model with site-rate variation C: this model with simulation parameters D: this model with parameters estimated from simulated data x axis: estimated distances y axis: true distances (based on simulation) Conclusions • failing to model selection effects leads to substantial underestimation of longer distances • possible to estimate equilibrium amino acid frequencies from realistic data sets with an accuracy sufficient for estimating distances between highly divergenct sequences • model accounts for heterogeneity of rates in a novel, and more biologically realistic way • model parameters could in theory be estimated simultaneously using ML or bayesian estimation

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document