* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download L - bioweb: molecular modelling group
Survey
Document related concepts
Transcript
Filogenias moleculares INTRODUCCIÓN A LA BIOINFORMÁTICA 2012 Paulino Gomez-Puertas Bioinformática. Richard Owen Paulino Gomez-Puertas Bioinformática. Owen’s definition of homology • Homologue: the same organ under every variety of form and function (true or essential correspondence - homology) • Analogy: superficial or misleading similarity Richard Owen 1843 Paulino Gomez-Puertas Bioinformática. Charles Darwin Paulino Gomez-Puertas Bioinformática. Darwin and homology • “The natural system is based upon descent with modification .. the characters that naturalists consider as showing true affinity (i.e. homologies) are those which have been inherited from a common parent, and, in so far as all true classification is genealogical; that community of descent is the common bond that naturalists have been seeking” Charles Darwin, Origin of species 1859 p. 413 Paulino Gomez-Puertas Bioinformática. Homology is... • Homology: similarity that is the result of inheritance from a common ancestor • The identification and analysis of homologies is central to phylogenetics (the study of the evolutionary history of genes and species) • Similarity and homology are not be the same thing although they are often and wrongly used interchangeably Paulino Gomez-Puertas Bioinformática. hypothesis: SIMILARITY implies HOMOLOGY HIGHER SIMILARITY implies CLOSER HOMOLOGY Paulino Gomez-Puertas Bioinformática. Clustering methods. - UPGMA (Unweighted Pair Group Method with Arithmetic mean) - Neighbour Joining (M. Saitou & M. Nei) (correct unequal rates of evolution in different branches of the tree) Cladistic methods: (patterns of ancestry) - Maximum parsimony - Maximum likelihood (assigns quantitative probalillities to mutational events, rather than merely counting them). Paulino Gomez-Puertas Bioinformática. Clustering methods. - UPGMA (Unweighted Pair Group Method with Arithmetic mean) - Neighbour Joining (M. Saitou & M. Nei) (correct unequal rates of evolution in different branches of the tree) A (simplistic) UPGMA example: legs/fins eggs/placenta branchias/lungs warm/cold-blooded cat L P L W whale F P L W lizard L E L C trout F E B C distance matrix: C W L T Paulino Gomez-Puertas C 0 W L 1 2 0 3 0 T 4 3 2 0 Bioinformática. distance matrix: C 0 C W L T 0.5 W L 1 2 0 3 0 T 4 3 2 0 UPGMA smallest nonzero distance 0.5 cat whale reduced distance matrix: (C / W) L T Paulino Gomez-Puertas (C / W) 0 L 1/2(2+3)=2.5 0 T 1/2(4+3)=3.5 2 0 Bioinformática. reduced distance matrix: (C / W) 0 (C / W) L L 2.5 0 0 1 1 0.5 cat smallest nonzero distance 2 T 0.5 UPGMA T 3.5 whale lizard trout reduced distance matrix: (C / W) (C / W) (L / T) Paulino Gomez-Puertas 0 (L / T) 1/2(2.5+3.5)=3 0 Bioinformática. UPGMA reduced distance matrix: (C / W) (C / W) 0 (L / T) C W L T C 0 W L T 1 2 4 0 3 3 0 2 0 smallest nonzero distance 3 0 1 0.5 cat whale Paulino Gomez-Puertas (L / T) 0.5 0.5 1 0.5 1 0.5 cat whale 1 1 lizard trout lizard trout Bioinformática. UPGMA using protein/dna multiple sequence alignments: cat whale lizard trout K K E E distance matrix: 1 E E E D D R D R cat whale lizard trout D D R R C W L T C 0 cat whale Paulino Gomez-Puertas T T T C C G C G C C G G W L T 1 2 4 0 3 3 0 2 0 0.5 1 0.5 1 0.5 0.5 A A T T 0.5 cat whale 1 1 lizard trout lizard trout Bioinformática. Cladistic methods: (patterns of ancestry) - Maximum parsimony - Maximum likelihood Maximum parsimony example (ATCG, ATGG, TCCA, TTCA) ATCG ATCA A->G ATCG C->G ATCG ATGG G->A A->T TTCA T->C TCCA TTCA four mutations ATCA A->G A->T T->C ATCG TCCA A->T TTCG T->A C->G G->A ATGG TTCA seven mutations Maximum likelihood: assigns quantitative probalillities to mutational events, rather than merely counting them. Paulino Gomez-Puertas Bioinformática. Bootstrapping • Characters are resampled with replacement to create many bootstrap replicate data sets • Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML) • Agreement among the resulting trees is summarized with a majority-rule consensus tree • Frequency of occurrence of groups, bootstrap proportions (BPs), is a measure of support for those groups • Additional information is given in partition tables Paulino Gomez-Puertas Bioinformática. Bootstrapping Original data matrix Taxa A B C D Outgp 1 R R Y Y R Characters 2 3 4 5 6 7 R Y Y Y Y Y R Y Y Y Y Y Y Y Y Y R R Y R R R R R R R R R R R 8 Y Y R R R Resampled data matrix Taxa A B C D Outgp 1 R R Y Y R Characters 2 2 5 5 6 6 R R Y Y Y Y R R Y Y Y Y Y Y Y Y R R Y Y R R R R R R R R R R 8 Y Y R R R Randomly resample characters from the original data with replacement to build many bootstrap replicate data sets of the same size as the original - analyse each replicate data set A B C 1 2 8 7 6 D 1 2 A B A C B C D D 5 5 96% 8 6 5 4 3 Summarise the results of multiple analyses with a majority-rule consensus tree Bootstrap proportions (BPs) are the frequencies with which groups are encountered in analyses of replicate data sets 2 6 2 1 66% Outgroup Outgroup Paulino Gomez-Puertas Outgroup Bioinformática. Phylogenetic systematics • Uses tree diagrams to portray relationships based upon recency of common ancestry • There are two types of trees commonly displayed in publications: – Cladograms – Phylograms Paulino Gomez-Puertas Bioinformática. Cladograms and phylograms Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Cladograms show branching order branch lengths are meaningless Eukaryote 3 Eukaryote 4 Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Phylograms show branch order and branch lengths Eukaryote 2 Eukaryote 3 Eukaryote 4 Paulino Gomez-Puertas Bioinformática. Rooting trees using an outgroup archaea eukaryote archaea Unrooted tree archaea eukaryote eukaryote eukaryote Rooted by outgroup bacteria outgroup archaea Monophyletic group archaea archaea eukaryote eukaryote root eukaryote Monophyletic group eukaryote Paulino Gomez-Puertas Bioinformática. Groups on trees A polyphyletic group is not a group at all! (e.g. if we put all things with wings in a single group) A monophyletic group (a clade) contains species derived from a unique common ancestor with respect to the rest of the tree A paraphyletic group is one which includes only some descendents (e.g. a group comprising animals without humans would be paraphyletic) Baldauf (2003). Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19:345-351. Paulino Gomez-Puertas Bioinformática. Is there a molecular clock? • The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962 • They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to time - as judged against the fossil record Paulino Gomez-Puertas Bioinformática. Introducing time in trees: the molecular clock Paulino Gomez-Puertas Bioinformática. The molecular clock for alpha-globin: 100 shark 80 carp 60 platypus chicken 40 500 400 300 200 0 100 cow 20 0 number of substitutions Each point represents the number of substitutions separating each animal from humans Time to common ancestor (millions of years) Paulino Gomez-Puertas Bioinformática. Rates of amino acid replacement in different proteins Protein Fibrinopeptides Insulin C Ribonuclease Haemoglobins Cytochrome C Histone H4 Paulino Gomez-Puertas Rate (mean replacements per site per 10 9 years) 8.3 2.4 2.1 1.0 0.3 0.01 Bioinformática. Small subunit ribosomal RNA 18S or 16S rRNA Paulino Gomez-Puertas Bioinformática. There is no universal molecular clock • The initial proposal saw the clock as a Poisson process with a constant rate • Now known to be more complex - differences in rates occur for: • • • • • different sites in a molecule different genes different regions of genomes different genomes in the same cell different taxonomic groups for the same gene • There is no universal molecular clock affecting all genes • There might be ‘local’ clocks but they need to be carefully tested and calibrated Paulino Gomez-Puertas Bioinformática. Chaperonin 60 Protein Maximum Likelihood Tree (PROTML, Roger et al. 1998, PNAS 95: 229) Longest branches Paulino Gomez-Puertas Bioinformática. Rate heterogeneity is a common problem in phylogenetic analyses • Differences in rates occur between: • different sites in a molecule (e.g. at different codon positions) • different genes on genomes • different regions of genomes • different genomes in the same cell • different taxonomic groups for the same gene • We need to consider these issues when we make trees - otherwise we can get the wrong tree Paulino Gomez-Puertas Bioinformática. Multiple changes at a single site - hidden changes Seq 1 Seq 2 AGCGAG GCGGAC Number of changes 1 Seq 1 C Seq 2 C Paulino Gomez-Puertas 3 2 G T 1 A A Bioinformática. Convergence can also mislead our methods: • Thermophilic convergence or biased codon usage patterns may obscure phylogenetic signal Paulino Gomez-Puertas Bioinformática. % Guanine + Cytosine in 16S rRNA genes from mesophiles and thermophiles Thermophiles: Thermotoga maritima Thermus thermophilus Aquifex pyrophilus Mesophiles: Deinococcus radiodurans Bacillus subtilis Paulino Gomez-Puertas %GC variable all sites sites 62 64 65 72 72 73 55 55 52 50 Bioinformática. Gene trees and species trees Gene tree a A b B c D Species tree We often assume that gene trees give us species trees Paulino Gomez-Puertas Bioinformática. Gene trees and species trees why might they differ? • Gene duplication • Horizontal gene transfer between species • Gene analysis can produce trees that conflict with accepted ideas of species relationships based upon external data Paulino Gomez-Puertas Bioinformática. ?? Mitochondrial (mt) genomes of Sauropsida (reptiles+birds). A. Llanes. Univ. de La Habana Paulino Gomez-Puertas Bioinformática. Gracias a: Federico Abascal Rafael Zardoya Hernán Dopazo Alehjandro Llanes Paulino Gomez-Puertas Centro Nacional de Biotecnología. Madrid Museo Nacional de Ciencias Naturales. Madrid CSAT - Príncipe Felipe Valencia Universidad de La Habana Cuba Bioinformática. Cuestiones… Paulino Gomez-Puertas Bioinformática.