Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression programming wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Microevolution wikipedia , lookup
Koinophilia wikipedia , lookup
9/29/2015 BINF 3350, Genomics and Bioinformatics Phylogenetic Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Early Evolution Studies Since Darwin Until 1960s Anatomical features were the dominant criteria used to derive evolutionary relationships between species Depend on the relatively subjective observations Hard to prove Example Giant panda 1 9/29/2015 DNA-Based Evolution Studies 1960s Emile Zuckerkandl and Linus Pauling brought reconstructing evolutionary relationships with DNA sequences Example Human vs. Gorilla Since 1960s DNA analysis is a dominant approach to study evolution Old Debates From the point of hemoglobin structure, it appears that gorilla is just an abnormal human, or man an abnormal gorilla, and the two species form actually one continuous population. Emile Zuckerkandl, Classification and Human Evolution, 1963 From any point of view other than that properly specified, that is of course nonsense. What the comparison really indicates is that hemoglobin is a bad choice and has nothing to tell us about attributes, or indeed tells us a lie. Gaylord Simpson, Science, 1964 2 9/29/2015 BINF 3350, Phylogenetic Analysis 1. Structure of Pylogenetic Tree 2. Evolutionary Distance Measures 3. Phylogenetic Tree Reconstruction Phylogenetic Tree Phylogenetics The study of evolutionary relatedness among species Phylogenetic Tree (Evolutionary Tree) Tree-structure diagram showing the inferred evolutionary relationships between a set of objects The objects are called taxa individual genes, or species when orthologous genes are used Each node represents each taxon Each edge represents the evolutionary relationship between taxa 3 9/29/2015 Types of Phylogenetic Trees (1) Rooted Tree vs. Unrooted Tree Rooted trees Root - the most ancient species External nodes (leaf nodes) - nodes with degree-1 - existing species Internal nodes - nodes with degree > 1 - hypothetical ancestral species Unrooted trees (before speciation events) Types of Phylogenetic Trees (2) Cladogram vs. Additive Tree vs. Ultrametric Tree Cladogram - Defines tree topology only - No meaning on branch lengths Additive trees - Branch lengths are a measure of evolutionary divergence evolutionary distance - Weighted trees Ultrametric trees - The vertical axis is a time scale - Rooted trees 4 9/29/2015 Evolutionary Distance Evolutionary Path The path from the root to a leaf in a rooted tree Evolutionary Distance Sum of weights of the shortest path between two leaf nodes in a weighted tree Example 2 Additive unrooted tree 3 12 16 14 13 12 1 4 13 17 12 5 13 6 Types of Phylogenetic Trees (3) Bifurcating vs. Multifurcating Bifurcating (or Dichotomous) - Each taxon as an internal node diverges into two separate descendent taxa - Fully resolved - What is the degree of internal nodes? - How many nodes a rooted bifurcating tree with N leaves has? - How many nodes an unrooted bifurcating tree with N leaves has? Multifurcating (or Polytomous) - Each taxon diverges into more than two separate descendent taxa - Partially resolved 5 9/29/2015 Types of Phylogenetic Trees (4) Condense Tree Types of Phylogenetic Trees (5) Species Tree vs. Gene Tree Species tree - Evolutionary relationships between species Gene (or gene family) tree - Evolutionary relationship between homologous genes - Some branch points represent gene duplication events - Other branch points represent speciation events 6 9/29/2015 Splitting Splitting Dividing a phylogenetic tree into a collection of subgroups by removing a branch Consensus Tree Consensus Features Consistent features supported by phylogenetic trees from the same data Consensus Tree A tree with consensus features 7 9/29/2015 BINF 3350, Phylogenetic Analysis 1. Structure of Pylogenetic Tree 2. Evolutionary Distance Measures 3. Phylogenetic Tree Reconstruction Evolutionary Distance Basics Evolutionary Distance Measured by a mutation rate between two species p-distance: the fraction of non-identical alignment positions (observed distance) Assumption All sequences evolved at a constant mutation rate The sequences diverged to a moderate degree s.t. each position had only one mutation Distance Correction p-distance underestimates the number of mutations Predict the amount of multiple mutations 8 9/29/2015 Types of Mutation (1) Why Mutation Occurs? The result of errors during DNA replication process Transition Mutations vs. Transversions Transition mutations: substitutions within the similar nucleotide structure Transversions: substitutions between different nucleotide structures Types of Mutation (2) Transition/Transversion Ratio (R) In principle, it should be ½ if all mutations are equally likely In practice, R » 1 (it varies) 9 9/29/2015 Types of Mutation (3) Synonymous vs. Nonsynonymous Synonymous mutations: substitutions that do not change the encoded Nonsynonymous mutations: substitutions that alter the encoded amino amino acid acid Types of Mutation (4) Selective Pressure Positive selection: a mutation is likely to be kept if it confers an advantage on the organism Negative selection: a mutation is likely to be lost if it confers a disadvantage on the organism Many of third-position mutations are synonymous Phenomenon of biased mutation pressure 10 9/29/2015 Gene Duplication (1) Gene Duplication Process that a gene becomes copied within the same genome Arises paralogs Review of Definitions Homologs: similar sequence + common ancestor (by divergent evolution) • Orthologs: pairs of genes which have a common ancestor immediately • Paralogs: pairs of genes which have a common ancestor immediately before a speciation event before a gene duplication event Analogs: similar sequence + no common ancestor (by convergent evolution) Gene Duplication (2) Phylogenetic Tree Reconstruction by Gene Duplication 11 9/29/2015 Gene Loss (1) Pseudogenes Genes that have mutated so as to no longer give rise to proteins Non-functional genes Gene Loss Process that a gene becomes a pseudogene Effects of Gene Loss on Phylogenetic Trees Gene Loss (2) Reconciled Tree Tree that indentifies the speciation and duplication events and the gene losses 12 9/29/2015 Species Tree Reconstruction (1) Species Tree Reconstructed by using orthologous genes COG (Clusters of Orthologous Groups) database from NCBI Resolving the difficulties by gene duplications and gene losses Examines trees from many different orthologous genes, and find the Restrict the analysis to the orthologous genes that appear not to have evolutionary history supported by a majority of trees duplication events Species Tree Reconstruction (2) Proportion of Orthologous Genes 13 9/29/2015 Horizontal Gene Transfer Horizontal Gene Tranfer (HGT) Also called Lateral Gene Transfer (LGT) Process that a gene from one species is transferred to another species Advanced Evolutionary Models p-Distance Model Poisson distance correction Jukes-Cantor Model Assumptions Formulas 14 9/29/2015 BINF 3350, Phylogenetic Analysis 1. Structure of Pylogenetic Tree 2. Evolutionary Distance Measures 3. Phylogenetic Tree Reconstruction Phylogenetic Tree Reconstruction Algorithms Phylogenetic Tree Reconstruction Algorithms Algorithm Algorithm Type Tree Type UPGMA distance-based ultrametric F-M distance-based unrooted, additive neighbor-joining distance-based unrooted, additive max parsimony character-based rooted, additive/cladogram max likelihood character-based (un)rooted, additive 15 9/29/2015 UPGMA UPGMA Unweighted pair-group method using arithmetic averages Input Distance matrix D: nn matrix of evolutionary distance for n species Output An ultrametric (rooted) tree for n species Process (1) Merge two closest nodes, x and y, to create one ancestor node z (2) Draw the height of z by distance between x and y (3) Remove the rows and columns of x and y in D (4) Insert the row and column of z with average distance in D (5) Repeat (1)~(4) until reaches the root Fitch-Margoliash Algorithm (1) 3-Leaf Tree A basic component for unrooted, additive (weighted) tree for 3 species dic + djc = Dij dic = (Dij + Dik – Djk) / 2 dic + dkc = Dik djc = (Dij + Djk – Dik) / 2 djc + dkc = Djk dkc = (Dki + Dkj – Dij) / 2 n-Leaf Tree How many edges for n species ? How many variables ? How many equations ? 16 9/29/2015 Fitch-Margoliash Algorithm (2) Neighbors A pair of nodes that are separated by just one other node Input Distance matrix D: nn matrix of evolutionary distance for n species Output An unrooted, additive (weighted) tree for n species Process (1) Find two closest nodes, x and y, as neighbors (2) Calculate distance from x and y to their ancestor z by 3-leaf tree formula (3) Remove the rows and columns of x and y in D (4) Insert the row and column of z with distance by 3-leaf tree formula (5) Repeat (1)~(4) until completes an unrooted tree Maximum Parsimony Methods Main Concept Find evolutionary history with the minimum number of mutations required to produce observed sequences Consider mutation at each position of the sequences separately (character-based approach) Parsimony Score Sum of the weights of all edges in the phylogenetic tree Examples ACCA 1 ATCA 1 ATCG ACCG 1 ACCG 1 ACCA less parsimonious 1 0 ATCG ACCG 0 ATCG 2 ACCA more parsimonious 17 9/29/2015 Fitch Algorithm (1) Unweighted Parsimony Method Build a phylogenetic tree having the min score (i.e., max parsimony) given sequences of the existing species to be considered Count mismatches for scoring Post-order traversal Input A tree with each leaf node labeled by a single nucleotide on a specific position of each sequence Output A rooted tree, cladogram with all labels on internal nodes Fitch Algorithm (2) Process (1) Assigns a set of candidate labels to each internal node by traversing from leaf nodes to root • If two sets of labels from child nodes of a node v overlap, then assigns the common set to v • If not, assigns the combined set to v if Su and Sw overlap otherwise (2) Assigns a label to each node by traversing from root to leaf nodes • For the root, chooses one arbitrarily from candidates • For all other nodes, if its parent’s label is a candidate, then assigns its parent’s label • Else, choose one arbitrarily from candidates 18 9/29/2015 Fitch Algorithm (3) Example A {A,C} A {G} C G A G A G C {A,C,G} {A,C} A G G A {A,C} {G} C G G A {G} C G G Parsimony score? Maximum Likelihood Method Maximum Likelihood (MLE) Maximum Likelihood Method for Evolutionary Tree Reconstruction Jukes-Cantor Model in Maximum Likelihood 19 9/29/2015 Examples Reconstructed Phylogenetic Trees for 16S RNA datasets Evolutionary Tree Validation Bootstrapping Method Main concept • Sampling data objects uniformly with replacement to validate statistical estimates Application to evolutionary tree validation 20