Download D - Clayton State University

Biology 4900 Biocomputing Chapter 7 Molecular Phylogeny and Evolution Outline • Introduction to evolution and phylogeny • Nomenclature of relationship trees • Five stages of molecular phylogeny: 1. 2. 3. 4. 5. Selecting sequences Multiple sequence alignment Models of substitution Tree-building Tree evaluation Historical background: Evolution • Evolution: Theory that groups of organisms change over time; descendants are structurally and functionally different than ancestors. • Darwin (1859) proposed the role of natural selection in hereditary change. – Heredity is generally conservative: changes in offspring tend to be small. • Prior to 1950’s, phylogeny was studied by comparing living species with fossils. – Modern techniques of molecular biology not available! Pevsner, Bioinformatics and Functional Genomics, 2009; http://www.ucsd.tv/evolutionmatters/lesson2/study.shtml Tree of Life http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734; http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg 3 main mechanisms of evolutionary change • Conditions of growth affect development of organism. – Physiological response to external stimuli – One or more differences in an organism that increase it’s likelihood of surviving in an environment long enough to reproduce • Change resulting from sexual reproduction. – Offspring contains genetic material from 2 parents. • Mutation with selection/genetic drift. – Similar to first mechanism Pevsner, Bioinformatics and Functional Genomics, 2009 Historical background • Studies of molecular evolution began with the first sequencing of proteins, beginning in the 1950s. • In 1953 Frederick Sanger and colleagues determined the primary amino acid sequence of insulin. • (The accession number of human insulin is NP_000198) Pevsner, Bioinformatics and Functional Genomics, 2009; http://en.wikipedia.org/wiki/Frederick_Sanger Historical background: insulin • By the 1950s, it became clear that amino acid substitutions occur non-randomly. – For example, Sanger and colleagues noted that most amino acid changes in the insulin A chain are restricted to a disulfide loop region. – Such differences are called “neutral” changes (Kimura, 1968; Jukes and Cantor, 1969). • Subsequent studies at the DNA level showed that rate of nucleotide (and of amino acid) substitution is about six- to ten-fold higher in the C peptide, relative to the A and B chains. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 219 Note the sequence divergence in the disulfide loop region of the A chain Fig. 7.3 Page 220 0.1 x 10-9 1 x 10-9 Number of nucleotide substitutions/site/year 0.1 x 10-9 Fig. 7.3 Page 220 Historical background: insulin Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulin from other species. Why? The answer: Guinea pig and coypu insulin do not bind two zinc ions, while insulin molecules from most other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly. Sus scrofa insulin (1ZNI.pdb) Page 219 Guinea pig and coypu insulin have undergone an extremely rapid rate of evolutionary change Arrows indicate positions at which guinea pig insulin (A and B chains) differs from human and mouse Pevsner, Bioinformatics and Functional Genomics, 2009, p. 220 Historical Background: Vasopressin and Oxytocin • Vasopressin and oxytocin also sequenced in 1950’s. • Sequences differ by only 2 residues, but functions differ significantly • Vasopressin – Reduces the volume of urine formed in the body to retain water • Oxytocin – Causes contractions in smooth muscles of uterus Seager SL, Slabaugh MR, Chemistry for Today: General, Organic and Biochemistry, 7 th Edition, 2011 Molecular clock hypothesis • In 1960s, sequence data accumulated for small, abundant proteins (e.g., globins, cytochromes c, fibrinopeptides). • Some proteins appeared to evolve slowly, while others evolved rapidly. • Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock: • For every given protein, rate of molecular evolution is approximately constant in all evolutionary lineages. Mutations over time Species A Species B Protein X Species C Species D Pevsner, Bioinformatics and Functional Genomics, 2009, p. 221 Molecular clock hypothesis corrected amino acid changes per 100 residues (m) • • • • Millions of years since divergence Dickerson, 1972 Richard Dickerson (1971) plotted sequence data from three protein families: cytochrome c, hemoglobin, and fibrinopeptides. X-axis shows the divergence times of the species, estimated from paleontological data. Y-axis shows m, the corrected number of amino acid changes per 100 residues. n is observed number of amino acid changes per 100 residues, and it is corrected to m to account for changes that occur but are not observed. Molecular clock hypothesis Molecular clock hypothesis: conclusions Dickerson drew the following conclusions: • For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein. • The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides). • The observed variations in rate of change reflect functional constraints imposed by natural selection. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 223 Molecular clock for proteins: rate of substitutions per aa site per 109 years Fibrinopeptides Kappa casein Lactalbumin Serum albumin Lysozyme Trypsin Insulin Cytochrome c Histone H2B Ubiquitin Histone H4 9.0 3.3 2.7 1.9 0.98 0.59 0.44 0.22 0.09 0.010 0.010 Table 7-1 Page 223 Molecular clock hypothesis: problems • Rate of molecular evolution varies among different organisms (e.g., viral sequences changes very rapidly). • Clock varies among different genes and parts of genes – guided in part by selection (e.g., animals with shorter generational times may have faster clocks). • Clock only applicable when gene retains its function over evolutionary time. – Genes may become nonfunctional – Rate of evolution may accelerate following gene duplication Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225 Rate of nucleotide substitution r and time of divergence T • Rate of nucleotide substitution (r) = number of nucleotide substitutions that occur per site per year (same for AAs) • Time of divergence (T) = how long ago the two sequences split from a common sequence. • A constant (K) defines the number of non-synonymous substitutions per site. • α-globins from rat and human differ by 0.093 non-synonymous substitutions by site • Rate of change = 0.58 × 10-9 nonsynonymous nucleotide substitutions per site per year Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225 Neutral theory of molecular evolution • Significant amount of DNA polymorphism across all species that is difficult to account for by conventional natural selection. • Kimura (1969, 1983) proposed an alternative model to Darwinian evolution at DNA level. – Observed that rate of AA substitution averages ~1 change per 28 × 106 years for proteins of 100 residues. – Corresponding rate of nucleotide substitution is 1 base pair in population genome every 2 years. – Kimura concluded that most DNA substitutions must be neutral and that main cause of evolutionary change at molecular level is random drift of mutant alleles. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225 Molecular phylogeny: nomenclature of trees • Molecular Phylogeny: Study of evolutionary relationships among organisms or molecules – Determined via molecular biology • There are two main kinds of information inherent to any tree: topology and branch lengths. • We will now describe the parts of a tree. Page 231 Molecular phylogeny uses trees to depict evolutionary relationships among organisms. These trees are based upon DNA and protein sequence data. Cladogram Phylogram 2 A 1 I 2 1 1 G B H 2 1 6 1 2 C 2 D B C 2 1 E A 2 F D 6 one unit E time Change in Time Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232 Change in Magnitude Tree Nomenclature Node (intersection or terminating point of two or more branches) Branch (edge) connecting 2 nodes A F G B I H C D E time Fig. 7.8 Page 232 Tree Nomenclature: Nodes OTUs: Taxa that provide observable features (e.g., existing protein sequences, morphological features) G I A F B H C D Root Node Internal Node (Taxon) Operational taxonomic unit (OTU) time Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232 E Tree Nomenclature: Cladogram vs. Phylogram Branches are unscaled... 2 I 2 A F Cladogram 1 Phylogram 1 G B H 2 1 6 Branches are scaled... 1 1 2 C 2 E B C 2 1 D A 2 D 6 one unit E time …OTUs are neatly aligned, and nodes reflect time Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232 …branch lengths are proportional to number of amino acid changes Tree nomenclature: Internal Nodes bifurcating internal node multifurcating internal node 2 A 1 I 2 1 1 G B H 2 1 6 A 2 F B 2 C 2 2 1 D E C D 6 one unit E time Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232 Tree nomenclature: clades Clade: Group of taxa derived from a common ancestor Clade ABF/CDH/G (paraphyletic) Clade ABF (monophyletic group) 2 F I 1 1 B G H 2 1 6 A F 1 2 2 A C D E Clade CDH (monophyletic group) Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232 I 2 1 G B H 2 1 6 C D E Tree nomenclature: Newick Format (,,(,)); no nodes are named (A,B,(C,D)); leaf nodes are named (A,B,(C,D)E)F; all nodes are named (:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent (A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names (popular) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names ((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A; a tree rooted on a leaf node (rare) http://en.wikipedia.org/wiki/Newick_format Examples of clades Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10 Tree Roots • The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor. past 9 1 7 5 8 6 2 present 1 7 3 4 2 5 Rooted tree (specifies evolutionary path) Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233 8 6 3 Unrooted tree 4 Tree Roots • A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other OTUs). past root 9 10 7 8 7 6 2 present 9 8 3 4 1 2 5 Rooted tree Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233 1 3 4 5 6 Outgroup (used to place the root) Enumerating trees Cavalii-Sforza and Edwards (1967) derived the number of possible unrooted trees (NU) for n OTUs (n > 3): NU = (2n-5)! 2n-3(n-3)! The number of bifurcating rooted trees (NR) NR = (2n-3)! 2n-2(n-2)! For 10 OTUs (e.g. 10 DNA or protein sequences), the number of possible rooted trees is  34 million, and the number of unrooted trees is  2 million. Many tree-making algorithms can exhaustively examine every possible tree for up to ten to twelve sequences. Page 235 Numbers of possible trees extremely large for >10 sequences Number of OTUs 2 3 4 5 10 20 Number of rooted trees 1 3 15 105 34,459,425 8 x 1021 Pevsner, Bioinformatics and Functional Genomics, 2009, p. 235 Number of unrooted trees 1 1 3 15 105 2 x 1020 Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243 Stage 1: Use of DNA, RNA, or protein DNA • DNA lets you study synonymous and nonsynonymous changes • Synonymous changes do not have corresponding protein changes. • synonymous substitution rate (dS) • nonsynonymous substitution rate (dN) • If dS > dN, DNA sequence is under negative (purifying) selection. – This limits change in the sequence (e.g. insulin A chain). • If dS < dN, positive selection occurs. – Ex. Duplicated gene may evolve rapidly to assume new functions. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 240 Stage 1: Use of DNA, RNA, or protein DNA • Some substitutions in a DNA sequence alignment can be directly observed: A G T C C T G T T C A G hypothetical ancestral globin A G→C T C C T G T→A T→G C→G A G→T→G human globin A G→C T C→G C→T→A T G T→C T C→T→G A G mouse globin Pevsner, Bioinformatics and Functional Genomics, 2009, p. 241, Figure 7.16 AA CC TT CG CA TT GG AC GT GG AA GG observed alignment Parallel substitution Single substitution Sequential substitution Coincidental substitution Convergent substitution Back substitution Stage 1: Use of DNA, RNA, or protein DNA •Noncoding regions (such as 5′ and 3′ untranslated regions of genes, or introns) may be analyzed using molecular phylogeny. •Pseudogenes (nonfunctional genes) are studied by molecular phylogeny •Rates of transitions and transversions can be measured. •Transitions: purine (A → G) or pyrimidine (C → T) substitutions •Transversion: purine ↔ pyrimidine Pevsner, Bioinformatics and Functional Genomics, 2009, p. 242 Stage 1: Use of DNA, RNA, or protein Protein • Proteins frequently more informative than DNA in pairwise alignment and in BLAST searching. • Proteins have 20 states (amino acids) instead of only 4 for DNA, so there is a stronger phylogenetic signal. • Nucleotides are unordered characters: any one nucleotide can change to any other in one step. • An ordered character must pass through one or more intermediate states before reaching the final state. • Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243 Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation Stage 2: Multiple sequence alignment Fundamental basis of phylogenetic tree is the MSA. 1. Confirm that all sequences are homologous – Trees can still be generated even with misalignment, or if a nonhomologous sequence is included in the alignment. 2. Adjust gap creation and extension penalties as needed to optimize the alignment 3. Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data or gaps. Similar to masking). -----GADEG-YFGPVILAADGEVA--dlgnvGA-EGDYFGPAI--AEGEVArpl Pevsner, Bioinformatics and Functional Genomics, 2009 Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation Stage 4: Tree-building methods: distance • Simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences. • Degree of divergence is called the Hamming distance. • For alignment of length N with n sites at which there are differences, the degree of divergence D is: D=n/N • Observed differences do not equal genetic distance! – Genetic distance involves mutations that are not observed directly. Pevsner, Bioinformatics and Functional Genomics, 2009 Stage 4: Tree-building methods: distance Jukes and Cantor (1969) proposed a corrective formula: • This model describes the probability that one nucleotide will change into another. • Still imperfect, as it assumes that each residue is equally likely to change into any other (i.e. the rate of transversions equals the rate of transitions). • In practice, the transition is typically greater than the transversion rate. Pevsner, Bioinformatics and Functional Genomics, 2009 Stage 4: Tree-building methods: distance Jukes and Cantor (1969) proposed a corrective formula: Consider an alignment where 3/60 aligned residues differ. The normalized Hamming distance is 3/60 = 0.05. The Jukes-Cantor correction is When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial: Pevsner, Bioinformatics and Functional Genomics, 2009 Models of nucleotide substitution transition > transversion A transition G transversion pyrimidines purines transversion T C transition Pevsner, Bioinformatics and Functional Genomics, 2009 Stage 4: Tree-building methods Distance-based methods • Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. • Examples of distance-based algorithms are UPGMA and neighbor-joining. Character-based methods • Include maximum parsimony and maximum likelihood. • Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa. Pevsner, Bioinformatics and Functional Genomics, 2009 Tree-building methods: UPGMA UPGMA (unweighted pair group method using arithmetic mean) is based on clustering of sequences 1 2 3 4 5 Pevsner, Bioinformatics and Functional Genomics, 2009 Tree-building methods: UPGMA Step 1: compute the pairwise distances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree. 1 2 3 4 5 Pevsner, Bioinformatics and Functional Genomics, 2009 Tree-building methods: UPGMA Step 2: Find the two proteins with the smallest pairwise distance. Cluster them. 1 2 6 3 4 5 Pevsner, Bioinformatics and Functional Genomics, 2009 1 2 Tree-building methods: UPGMA Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them. 1 2 6 1 3 4 5 Pevsner, Bioinformatics and Functional Genomics, 2009 7 2 4 5 Tree-building methods: UPGMA Step 4: Keep going. Cluster. 1 8 2 7 6 3 4 5 Pevsner, Bioinformatics and Functional Genomics, 2009 1 2 4 5 3 Tree-building methods: UPGMA Step 4: Last cluster! This is your tree. 9 1 2 8 7 3 6 4 5 Pevsner, Bioinformatics and Functional Genomics, 2009 1 2 4 5 3 Another View of UPGMA: Pairwise Distance Matrix A B C B dAB -- -- C dAC dBC -- D dAD dBD dCD Pevsner, Bioinformatics and Functional Genomics, 2009 B C D E A 9 8 12 15 B -11 15 18 C --10 13 D ---5 Another View of UPGMA: Pairwise Distance Matrix B C D E A 9 8 12 15 B -11 15 18 C --10 13 D ---5 Start with pair having shortest distance Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html Pevsner, Bioinformatics and Functional Genomics, 2009 Distance-based methods: UPGMA trees UPGMA is a simple approach for making trees. • A UPGMA tree is always rooted. • An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong. • While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next). Pevsner, Bioinformatics and Functional Genomics, 2009 Making trees using neighbor-joining The neighbor-joining method of Saitou and Nei (1987) is especially useful for making trees with: • Large number of taxa • Lineages with varying rates of evolution Begin by placing all the taxa in a star-like structure. Pevsner, Bioinformatics and Functional Genomics, 2009 Tree-building methods: Neighbor joining Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths. Pevsner, Bioinformatics and Functional Genomics, 2009 Tree-building methods: Neighbor joining Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d12) Pevsner, Bioinformatics and Functional Genomics, 2009 Example of a neighbor-joining tree: phylogenetic analysis of 13 RBPs Distribution of SC angles S100 1K96 1WDC CC 2BL0CC 1WDC 2BL0 1WDC B 2BL0 B 1EXR S100 1XK4 A R2 S100 1E8A 1GGZ S100 1J55 2BL0 A 2AAO Calbindin d9k 1QX2 Calbindin d9k 1WDC A 4ICB 1K94 Penta-EF 2CCM 1Y1X Penta-EF 1RRO 1S6C B Parvalbumin 2PVB Parvalbumin Osteonectin 1SRA 1A75 2HQ8 1BU3 Parvalbumin Parvalbumin 2H2K 2HQ8 2H2K 5PAL S100 1PVA Parvalbumin Parvalbumin 2SCP 1RWY Parvalbumin 1SL8 Polcalcin 1QV1 0.1 N-J Phylogenic Tree Unrooted generated by Treeview 2OPO1K9U 1XO5 Polcalcin 1OMR 1S6C A 1G8I R1 Distribution of MC angles S100 1K96 1WDC CC 2BL0CC 1WDC 2BL0 1WDC B 2BL0 B 1EXR S100 1XK4 A R1 S100 1E8A 1GGZ S100 1J55 2BL0 A 2AAO Calbindin d9k 1QX2 Calbindin d9k 1WDC A 4ICB 1K94 Penta-EF 2CCM 1Y1X Penta-EF 1RRO 1S6C B Parvalbumin 2PVB Parvalbumin Osteonectin 1SRA 1A75 2HQ8 1BU3 Parvalbumin Parvalbumin 2H2K 2HQ8 2H2K 5PAL S100 1PVA Parvalbumin Parvalbumin 2SCP 1RWY Parvalbumin 1SL8 Polcalcin 1QV1 0.1 N-J Phylogenic Tree Unrooted generated by Treeview 2OPO1K9U 1XO5 Polcalcin 1OMR 1S6C A 1G8I R2 Parsimony Principle • The parsimony principle tells us to choose the simplest scientific explanation that fits the evidence. In tree-building, this means that the best hypothesis is the one that requires the fewest evolutionary changes. Hypothesis 1 is the most parsimonious because it requires fewer evolutionary changes. Hypothesis 2 requires formation of bony skeleton at 2 different points. http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08 Building trees using character-based methods • Rather than pairwise distances between proteins, aligned columns of AA residues are evaluated. • The main idea of character-based methods is to find the tree with the shortest branch lengths. Thus we seek the most parsimonious (“simple”) tree. • Identify informative sites. For example, constant characters are not parsimony-informative. • Construct trees, counting the number of changes required to create each tree. • Select the shortest tree (or trees). Pevsner, Bioinformatics and Functional Genomics, 2009 As an example of tree-building using maximum parsimony, consider these four taxa: AAG AAA GGA AGA How might they have evolved from a common ancestor such as AAA? Pevsner, Bioinformatics and Functional Genomics, 2009 Parsimony Principle • The parsimony principle tells us to choose the simplest scientific explanation that fits the evidence. In tree-building, this means that the best hypothesis is the one that requires the fewest evolutionary changes. AAA 1 AAA AAG AAA 1 1 AGA GGA AGA Cost = 3 AAA 1 AAA 1 AAG AGA AAA AAA 2 AAA GGA Cost = 4 1 AAA 2 AAG GGA AAA AAA AGA Cost = 4 In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths). http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08 1 Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation Pevsner, Bioinformatics and Functional Genomics, 2009 Stage 5: Evaluating trees: bootstrapping Bootstrapping is common approach to measuring robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set? To bootstrap, create artificial dataset by randomly sampling columns from your multiple sequence alignment. Make dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates. Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant. Pevsner, Bioinformatics and Functional Genomics, 2009 In 61% of the bootstrap resamplings, ssrbp and btrbp (pig and cow RBP) formed a distinct clade. In 39% of the cases, another protein joined the clade (e.g. ecrbp), or one of these two sequences joined another clade.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download D - Clayton State University