Download L - bioweb: molecular modelling group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene nomenclature wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

RNA-Seq wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular ecology wikipedia , lookup

Transcript
Filogenias moleculares
INTRODUCCIÓN A LA BIOINFORMÁTICA
2012
Paulino Gomez-Puertas
Bioinformática.
Richard Owen
Paulino Gomez-Puertas
Bioinformática.
Owen’s definition of
homology
• Homologue: the same organ under every variety
of form and function (true or essential
correspondence - homology)
• Analogy: superficial or misleading similarity
Richard Owen 1843
Paulino Gomez-Puertas
Bioinformática.
Charles Darwin
Paulino Gomez-Puertas
Bioinformática.
Darwin and homology
• “The natural system is based upon descent with
modification .. the characters that naturalists consider as
showing true affinity (i.e. homologies) are those which
have been inherited from a common parent, and, in so
far as all true classification is genealogical; that
community of descent is the common bond that
naturalists have been seeking”
Charles Darwin, Origin of species 1859 p. 413
Paulino Gomez-Puertas
Bioinformática.
Homology is...
• Homology: similarity that is the result of inheritance from
a common ancestor
• The identification and analysis of homologies is central
to phylogenetics (the study of the evolutionary history of
genes and species)
• Similarity and homology are not be the same thing
although they are often and wrongly used
interchangeably
Paulino Gomez-Puertas
Bioinformática.
hypothesis:
SIMILARITY implies HOMOLOGY
HIGHER SIMILARITY implies CLOSER HOMOLOGY
Paulino Gomez-Puertas
Bioinformática.
Clustering methods.
- UPGMA (Unweighted Pair Group Method with Arithmetic mean)
- Neighbour Joining (M. Saitou & M. Nei)
(correct unequal rates of evolution in different branches of the tree)
Cladistic methods: (patterns of ancestry)
- Maximum parsimony
- Maximum likelihood (assigns quantitative probalillities to mutational
events, rather than merely counting them).
Paulino Gomez-Puertas
Bioinformática.
Clustering methods.
- UPGMA (Unweighted Pair Group Method with Arithmetic mean)
- Neighbour Joining (M. Saitou & M. Nei)
(correct unequal rates of evolution in different branches of the tree)
A (simplistic) UPGMA example:
legs/fins
eggs/placenta
branchias/lungs warm/cold-blooded
cat
L
P
L
W
whale
F
P
L
W
lizard
L
E
L
C
trout
F
E
B
C
distance matrix:
C
W
L
T
Paulino Gomez-Puertas
C
0
W L
1 2
0 3
0
T
4
3
2
0
Bioinformática.
distance matrix:
C
0
C
W
L
T
0.5
W L
1 2
0 3
0
T
4
3
2
0
UPGMA
smallest nonzero distance
0.5
cat
whale
reduced distance matrix:
(C / W)
L
T
Paulino Gomez-Puertas
(C / W)
0
L
1/2(2+3)=2.5
0
T
1/2(4+3)=3.5
2
0
Bioinformática.
reduced distance matrix:
(C / W)
0
(C / W)
L
L
2.5
0
0
1
1
0.5
cat
smallest nonzero
distance
2
T
0.5
UPGMA
T
3.5
whale
lizard
trout
reduced distance matrix:
(C / W)
(C / W)
(L / T)
Paulino Gomez-Puertas
0
(L / T)
1/2(2.5+3.5)=3
0
Bioinformática.
UPGMA
reduced distance matrix:
(C / W)
(C / W)
0
(L / T)
C
W
L
T
C
0
W L T
1 2 4
0 3 3
0
2
0
smallest nonzero
distance
3
0
1
0.5
cat whale
Paulino Gomez-Puertas
(L / T)
0.5
0.5
1
0.5
1
0.5
cat whale
1
1
lizard trout
lizard trout
Bioinformática.
UPGMA
using protein/dna multiple sequence alignments:
cat
whale
lizard
trout
K
K
E
E
distance matrix:
1
E
E
E
D
D
R
D
R
cat
whale
lizard
trout
D
D
R
R
C
W
L
T
C
0
cat whale
Paulino Gomez-Puertas
T
T
T
C
C
G
C
G
C
C
G
G
W L T
1 2 4
0 3 3
0
2
0
0.5
1
0.5
1
0.5
0.5
A
A
T
T
0.5
cat whale
1
1
lizard trout
lizard trout
Bioinformática.
Cladistic methods: (patterns of ancestry)
- Maximum parsimony
- Maximum likelihood
Maximum parsimony example (ATCG, ATGG, TCCA, TTCA)
ATCG
ATCA
A->G
ATCG
C->G
ATCG ATGG
G->A
A->T
TTCA
T->C
TCCA TTCA
four mutations
ATCA
A->G
A->T
T->C
ATCG TCCA
A->T
TTCG
T->A
C->G
G->A
ATGG TTCA
seven mutations
Maximum likelihood: assigns quantitative probalillities to mutational events,
rather than merely counting them.
Paulino Gomez-Puertas
Bioinformática.
Bootstrapping
• Characters are resampled with replacement to
create many bootstrap replicate data sets
• Each bootstrap replicate data set is analysed (e.g.
with parsimony, distance, ML)
• Agreement among the resulting trees is
summarized with a majority-rule consensus tree
• Frequency of occurrence of groups, bootstrap
proportions (BPs), is a measure of support for
those groups
• Additional information is given in partition tables
Paulino Gomez-Puertas
Bioinformática.
Bootstrapping
Original data matrix
Taxa
A
B
C
D
Outgp
1
R
R
Y
Y
R
Characters
2 3 4 5 6 7
R Y Y Y Y Y
R Y Y Y Y Y
Y Y Y Y R R
Y R R R R R
R R R R R R
8
Y
Y
R
R
R
Resampled data matrix
Taxa
A
B
C
D
Outgp
1
R
R
Y
Y
R
Characters
2 2 5 5 6 6
R R Y Y Y Y
R R Y Y Y Y
Y Y Y Y R R
Y Y R R R R
R R R R R R
8
Y
Y
R
R
R
Randomly resample characters from the original data with
replacement to build many bootstrap replicate data sets of the
same size as the original - analyse each replicate data set
A
B
C
1
2
8
7
6
D
1
2
A
B
A
C
B
C
D
D
5
5
96%
8
6
5
4
3
Summarise the results of
multiple analyses with a
majority-rule consensus tree
Bootstrap proportions (BPs) are
the frequencies with which
groups are encountered in
analyses of replicate data sets
2
6
2
1
66%
Outgroup
Outgroup
Paulino Gomez-Puertas
Outgroup
Bioinformática.
Phylogenetic
systematics
• Uses tree diagrams to portray relationships
based upon recency of common ancestry
• There are two types of trees commonly displayed
in publications:
– Cladograms
– Phylograms
Paulino Gomez-Puertas
Bioinformática.
Cladograms and phylograms
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Eukaryote 2
Cladograms show
branching order branch lengths are
meaningless
Eukaryote 3
Eukaryote 4
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Phylograms show
branch order and
branch lengths
Eukaryote 2
Eukaryote 3
Eukaryote 4
Paulino Gomez-Puertas
Bioinformática.
Rooting trees using an outgroup
archaea
eukaryote
archaea
Unrooted tree
archaea
eukaryote
eukaryote
eukaryote
Rooted
by outgroup
bacteria outgroup
archaea
Monophyletic group
archaea
archaea
eukaryote
eukaryote
root
eukaryote
Monophyletic
group
eukaryote
Paulino Gomez-Puertas
Bioinformática.
Groups on trees
A polyphyletic group is not a
group at all! (e.g. if we put all
things with wings in a single
group)
A monophyletic group (a clade)
contains species derived from a
unique common ancestor with respect
to the rest of the tree
A paraphyletic group is one
which includes only some
descendents (e.g. a group
comprising animals without
humans would be paraphyletic)
Baldauf (2003). Phylogeny for the faint of heart: a tutorial. Trends in Genetics 19:345-351.
Paulino Gomez-Puertas
Bioinformática.
Is there a molecular clock?
• The idea of a molecular clock was initially
suggested by Zuckerkandl and Pauling in
1962
• They noted that rates of amino acid
replacements in animal haemoglobins were
roughly proportional to time - as judged
against the fossil record
Paulino Gomez-Puertas
Bioinformática.
Introducing time in trees:
the molecular clock
Paulino Gomez-Puertas
Bioinformática.
The molecular clock for alpha-globin:
100
shark
80
carp
60
platypus
chicken
40
500
400
300
200
0
100
cow
20
0
number of substitutions
Each point represents the number of substitutions separating each
animal from humans
Time to common ancestor (millions of years)
Paulino Gomez-Puertas
Bioinformática.
Rates of amino acid replacement
in different proteins
Protein
Fibrinopeptides
Insulin C
Ribonuclease
Haemoglobins
Cytochrome C
Histone H4
Paulino Gomez-Puertas
Rate (mean replacements per site
per 10 9 years)
8.3
2.4
2.1
1.0
0.3
0.01
Bioinformática.
Small subunit ribosomal RNA
18S or 16S rRNA
Paulino Gomez-Puertas
Bioinformática.
There is no universal molecular
clock
• The initial proposal saw the clock as a Poisson process with
a constant rate
• Now known to be more complex - differences in rates occur
for:
•
•
•
•
•
different sites in a molecule
different genes
different regions of genomes
different genomes in the same cell
different taxonomic groups for the same gene
• There is no universal molecular clock affecting all genes
• There might be ‘local’ clocks but they need to be carefully
tested and calibrated
Paulino Gomez-Puertas
Bioinformática.
Chaperonin 60 Protein Maximum
Likelihood Tree (PROTML, Roger et al. 1998,
PNAS 95: 229)
Longest
branches
Paulino Gomez-Puertas
Bioinformática.
Rate heterogeneity is a common
problem in phylogenetic analyses
• Differences in rates occur between:
• different sites in a molecule (e.g. at different codon
positions)
• different genes on genomes
• different regions of genomes
• different genomes in the same cell
• different taxonomic groups for the same gene
• We need to consider these issues when we make
trees - otherwise we can get the wrong tree
Paulino Gomez-Puertas
Bioinformática.
Multiple changes at a single
site - hidden changes
Seq 1
Seq 2
AGCGAG
GCGGAC
Number of changes
1
Seq 1
C
Seq 2 C
Paulino Gomez-Puertas
3
2
G
T
1
A
A
Bioinformática.
Convergence can also mislead
our methods:
• Thermophilic convergence or biased codon
usage patterns may obscure phylogenetic
signal
Paulino Gomez-Puertas
Bioinformática.
% Guanine + Cytosine in 16S rRNA genes
from mesophiles and thermophiles
Thermophiles:
Thermotoga maritima
Thermus thermophilus
Aquifex pyrophilus
Mesophiles:
Deinococcus radiodurans
Bacillus subtilis
Paulino Gomez-Puertas
%GC variable
all sites sites
62
64
65
72
72
73
55
55
52
50
Bioinformática.
Gene trees and species trees
Gene tree
a
A
b
B
c
D
Species tree
We often assume that gene trees give us
species trees
Paulino Gomez-Puertas
Bioinformática.
Gene trees and species trees why might they differ?
• Gene duplication
• Horizontal gene transfer between species
• Gene analysis can produce trees that conflict
with accepted ideas of species relationships
based upon external data
Paulino Gomez-Puertas
Bioinformática.
??
Mitochondrial (mt) genomes of Sauropsida (reptiles+birds).
A. Llanes. Univ. de La Habana
Paulino Gomez-Puertas
Bioinformática.
Gracias a:
Federico Abascal
Rafael Zardoya
Hernán Dopazo
Alehjandro Llanes
Paulino Gomez-Puertas
Centro Nacional de Biotecnología.
Madrid
Museo Nacional de Ciencias
Naturales. Madrid
CSAT - Príncipe Felipe
Valencia
Universidad de La Habana
Cuba
Bioinformática.
Cuestiones…
Paulino Gomez-Puertas
Bioinformática.