* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download 05-1 Molecular Phylogeny
Survey
Document related concepts
Transcript
Molecular Phylogeny
分子系统发育分析
Bacteria
Archaea
Eukarya
Introduction
• Natural Selection
– “Natural selection is daily,
hourly, scrutinising the
slightest variations,
rejecting those that are
bad, preserving and
adding up all those that
are good”- The Origin of
Species
Charles Darwin (1809 1882)
Darwin’s Travels
Lamarck - adaptations
Wallace –
natural selection
Galapagos Finches
The Galapagos Finches
The beaks of the finches are adapted to different jobs
in the same way as tools.
Artificial Selection
Natural Selection
Overproduction
Individual
Variation
Unequal
Reproductive
Success
Charles Darwin’s 1859 book (On the Origin of Species
By Means of Natural Selection, or the Preservation of
Favoured Races in the Struggle for Life) introduced the
theory of evolution.
The struggle for existence induces a natural selection.
Tree of Life
Five kingdom
system
(Haeckel, 1879)
mammals
vertebrates
animals
invertebrates
plants
fungi
protists
monera
protozoa
Page 396
Introduction
At the molecular level, evolution is a process of mutation
with selection.
Molecular evolution is the study of changes in genes and
proteins throughout different branches of the tree of life.
Phylogeny is the inference of evolutionary relationships.
Traditionally, comparison of morphological features
Today, comparison of molecular sequence data
Introduction
In the 1920s and 1930s, a synthesis occurred between
Darwinism and Mendel’s principles of inheritance.
The basic processes of evolution are
[1] mutation, and also
[2] genetic recombination as two sources of variability;
[3] chromosomal organization (and its variation);
[4] natural selection
[5] reproductive isolation, which constrains the effects
of selection on populations
Levels of Selection
Species
Species level selection may lead to its extinction,
generally a large environmental change.
Interspecific competition and predation can lead to
population decline, unless the population can exploit
Population new niches or find novel ways of avoiding predation.
Individual
Gene
Intraspecific competition for shared resources acts
on the survival of progeny. Individuals that can
exploit their environment better than others survive
to pass their genes to their descendants.
The importance of the phenotypic characters expressed
by the genes decides how selection acts on them.
Examples of clades
Lindblad-Toh et al., Nature
438: 803, 8 Dec. 2005, fig. 10
直系同源、旁系同源
旁系同源
Orthologs
Frog
直系同源
直系同源
Chick
Paralogs
mouse
mouse
α链
chick frog
β 链
基因复制
原始血红蛋白基因
Orthologs
Gene duplication and loss
1
A
2
B
C
3
Pseudogene
Gene loss
gene merge
Gene Duplication
CONCEPT and DEFINITION
Orthologs:
They represent genes derived from a common ancestor
that diverged due to divergence of the organisms they
are associated with.
They tend to have similar function.
Paralogs
homologs produced by gene duplication. They represent
genes derived from a common ancestral gene that
duplicated within an organism and then subsequently
diverged.
They tend to have different functions.
CONCEPT and DEFINITION
Xenologs
homologs resulting from horizontal gene transfer between
two organisms. The determination of whether a gene of
interest was recently transferred into the current host by
horizontal gene transfer is often difficult. Function of
xenologs can be variable depending on how significant the
change in context was for the horizontally moving gene;
In general, the function tends to be similar.
CONCEPT and DEFINITION
Ohnology
Paralogous genes that have originated by a
process of whole-genome duplication (WGD).
The name was first given in honour of Susumu
Ohno by Ken Wolfe. Ohnologs are interesting for
evolutionary analysis because they all have been
diverging for the same length of time since their
common origin.
How to find orthologs and paralogs
In eukaryotic genomes, most genes are members of
gene families. When comparing genes from two species,
therefore, most genes in one species will be homologous to
multiple genes in the second. This often makes it difficult to
distinguish orthologs (separated through speciation) from
paralogs (separated by other types of gene duplication).
Combining phylogenetic relationships, gene function and
genomic position in both genomes helps to distinguish
between these scenarios. There are many publications on
this topic, such as:
• Steven B Cannon and Nevin D Young, OrthoParaMap: Distinguishing
orthologs from paralogs by integrating comparative genome data and
gene phylogenies, BMC Bioinformatics 2003, 4:35
Bidirectional best hits (BBH)
The best hit of a particular gene to a target
genome is the gene in that genome that
represents a best match. The match is
bidirectional if the two genes are best hits of
each other. A bidirectional best hit represents
a very strong similarity between two genes,
and is considered evidence that the genes
may be orthologs arising from a common
ancestor.
formally, the paper The use of gene
clusters to infer functional coupling defines a
bidirectional best hit (or BBH) as follows:
Given two genes Xa and Xb from two genomes Ga and Gb, Xa
and Xb are called a “bidirectional best hit (BBH)” if and only if
recognizable similarity exists between them (in our case, we required
Similarity Scores lower than 1.0 × 10−5), there is no gene Zb in Gb that
is more similar than Xb is to Xa, and there is no gene Za in Ga that is
more similar than Xa is to Xb.
Gene mapping
Genome Informatics 12: 44–53 (2001)
Use the bidirectional best hits
(BBH)
criterion
to
define
orthologs when two genomes are
compared
by
the
SmithWaterman algorithm at the amino
acid sequence level with the
threshold similarity score of 70.
To characterize genes of an
organism, its genes S(G1) are
once mapped to the nodes of the
graph G2 that encodes functional
orthologs in another organism.
After that, we compare G2 and an
additional graph G3 of the original
organism instead of comparing
G1 and G3 directly.
Gene mapping
Gene-gene relationships on a specific attribute
can be denoted by using a set of binary
relationships in a general manner. For example,
let a binary operator ' ∼ ' denote a binary
relationship between two genes, and let g1, g2,
g3, and g4 be a series of genes arranged in this
order in a genome sequence, their geometrical
relationships are broken down into a set of
binary relationships {g1 ∼ g2, g2 ∼ g3, g3 ∼ g4}.
A set of binary relationships among genes
forms a graph structure as a whole. Fig. shows
three graphs G1 (genome), G2 (pathway), and
G3 (similarity), where each graph node
corresponds to a gene or a gene product. In a
graph, two nodes are connected by an edge
(expressed by a solid line) when they are
related by a binary relationship
In a set of genes, if all or most of the genes reserve their mutual relationships in
multiple graphs, like the light gray nodes and the dark gray nodes, the biological
relevance among those genes is considered to be supported at high possibility. We
call such a set of genes a correlated gene cluster (or simply, correlated cluster), by
which we can characterize, classify, and predict the activities of genes.
A. Mouse
B. Human
Overview of the defensin gene cluster region in mouse
(top) and human (bottom). A clone tiling path is shown for
the corresponding regions in mouse (top) and human
(bottom). Clones are displayed in yellow but regions
overlapping with adjacent clones are shown in black. Genes
are indicated by arrows. Genes in shadowed boxes are
duplicated and the color indicates the pairs;
A -- highlights all potential Defcr5 genes (see color legend
for more details). The mouse assembly is based on
NCBIM37, in which three gaps currently exist; two gaps are
indicated by grey bars and the biggest gap between the two
clusters is joined by a 'V'.
小鼠defensin基因的注释:Amid et al. BMC Genomics 2009
10:606 doi:10.1186/1471-2164-10-606.
进化树的概念
Phylogenetic Trees: In each panel, the phylogenetic group is depicted by a
green shaded circle.
A) Monophyletic group. A species (C and D) share a common ancestor (E)
not shared by any other species.
B) Paraphyletic group. All species in the group share a common ancestor
(F), but some species (D) have been excluded from the group.
C) Polyphyletic group. A grouping of lineages each more closely related to
other species not in the group than they are two each other.
--From Barton et al., (2007) Evolution, p. 111.
有根树、无根树
标度树
进化树的概念
一般来说, 进化树是显示物种间进化关系的二维图, 也可以
反映来自不同物种的分子 (基因) 的进化关系。
sequence A
nodes
1、rooted tree
length of branches
reflects number of
sequence changes.
sequence B
sequence C
Often: assume uniform
rate of mutations
(molecular clock hypothesis).
sequence D
branches
sequence A
sequence C
2、unrooted tree
sequence B
sequence D
Molecular phylogeny: nomenclature of trees
There are two main kinds of information inherent
to any tree: topology and branch lengths.
We will now describe the parts of a tree.
Page 366
Molecular phylogeny uses trees to depict evolutionary
relationships among organisms.
These trees are based upon DNA and
protein sequence data.
2
A
F
1
I
2
1
G
B
H 2
1
6
C
D
E
time
Tree nomenclature
Node (intersection or terminating point
of two or more branches)
branch
(edge)
2
F
1
I
2
A
1
G
B
H 2
1
6
C
D
E
time
Tree nomenclature
taxon
taxon
Tree nomenclature
operational taxonomic unit (OTU)
such as a protein sequence
hypothetical taxonomic unit (HTU)
2
A
F
1
I
2
1
G
B
H 2
1
6
C
D
E
time
Tree nomenclature
Branches are scaled...
2
Branches are unscaled...
A
F
1
I
2
1
G
B
H 2
1
6
C
D
E
time
…OTUs are neatly aligned,
and nodes reflect time
…branch lengths are
proportional to number of
amino acid changes
Fig. 11.4
Page 366
Tree nomenclature
bifurcating
internal
node
multifurcating
internal
node
Fig. 11.5
Page 367
Tree nomenclature: clades
Clade ABF (monophyletic group)
2
F
1
I
2
A
1
B
G
H 2
1
6
C
D
E
time
Fig. 11.4
Page 366
Tree nomenclature
2
A
F
1
I
2
1
G
B
H 2
1
6
C
Clade CDH
D
E
time
Fig. 11.4
Page 366
Tree nomenclature
Clade ABF/CDH/G
2
A
F
1
I
2
1
G
B
H 2
1
6
C
D
E
time
Fig. 11.4
Page 366
单系类群、并系
类群、复系类群
内类群、外类群、
姐妹群
Species trees versus gene/protein trees
Molecular evolutionary studies can be complicated
by the fact that both species and genes evolve.
speciation usually occurs when a species becomes
reproductively isolated. In a species tree, each
internal node represents a speciation event.
Genes (and proteins) may duplicate or otherwise evolve
before or after any given speciation event. The topology
of a gene (or protein) based tree may differ from the
topology of a species tree.
Page 370
Molecular clock hypothesis
In the 1960s, sequence data were accumulated for small,
abundant proteins such as globins, cytochromes c, and
fibrinopeptides. Some proteins appeared to evolve slowly,
while others evolved rapidly.
Linus Pauling, Emanuel Margoliash and others proposed
the hypothesis of a molecular clock:
For every given protein, the rate of molecular evolution is
approximately constant in all evolutionary lineages
Molecular clock hypothesis
As an example, Richard Dickerson (1971) plotted data
from three protein families: cytochrome c(细胞色素),
hemoglobin (血色素), and fibrinopeptides(血纤维蛋白肽).
The x-axis shows the divergence times of the species,
estimated from paleontological data. The y-axis shows
m, the corrected number of amino acid changes per
100 residues.
n is the observed number of amino acid changes per
100 residues, and it is corrected to m to account for
changes that occur but are not observed.
N = 1 – e-(m/100)
100
corrected amino acid changes
per 100 residues (m)
Dickerson
(1971)
Millions of years since divergence
Molecular clock hypothesis: conclusions
Dickerson drew the following conclusions:
• For each protein, the data lie on a straight line. Thus,
the rate of amino acid substitution has remained
constant for each protein.
• The average rate of change differs for each protein.
The time for a 1% change to occur between two lines
of evolution is 20 MY (cytochrome c), 5.8 MY
(hemoglobin), and 1.1 MY (fibrinopeptides).
• The observed variations in rate of change reflect
functional constraints imposed by natural selection.
Molecular clock hypothesis: l and PAM
The rate of amino acid substitution is measured by l,
the number of substitutions per amino acid site per year.
Consider serum albumin:
l = 1.9 x 10-9
l x 109 = 1.9
Dayhoff et al. reported the rate of mutation acceptance for
serum albumin as 19 PAMs per amino acid residue per 100
million years.
(19 subst./1 aa/108 years = 1.9 subst./100 aa/109 years)
Molecular clock for proteins:
rate of substitutions per aa site per 109 years
Fibrinopeptides
Kappa casein
Lactalbumin
Serum albumin
Lysozyme
Trypsin
Insulin
Cytochrome c
Histone H2B
Ubiquitin
Histone H4
9.0
3.3
2.7
1.9
0.98
0.59
0.44
0.22
0.09
0.010
0.010
系统发育数据分析的步骤
对DNA/蛋白序列进行系统发育分析的四个主
要步骤:
1.
2.
3.
4.
多序列比对,
建立取代模型,
建立进化树,
进化树评估。
Partial alignment of histones from PFAM (l = 0.05)
H2A1_HUMAN/4-119
H2A1_YEAST/3-120
H2A3_VOLCA/5-119
H2A_PLAFA/5-120
H2A1_PEA/11-128
H2A1_TETPY/7-123
H2AM_RAT/4-116
H2A_EUGGR/18-134
H2A2_XENLA/4-119
H2AV_CHICK/6-121
H2AV_TETTH/6-131
R.KGNYAERV
R.RGNYAQRI
K.KGKYAERI
K.KGKYAKRV
K.KGRYAQRV
K.HGRYSERI
K.KGHPKYRI
R.AGRYAKRV
R.KGNYAERV
KTRTTSHGRV
KGRVSAKNRV
GAGAPVYLAA
GSGAPVYLTA
GAGAPVYLAA
GAGAPVYLAA
GTGAPVYLAA
GTGAPVYLAA
GVGAPVYMAA
GKGAPVYLAA
GAGAPVYLAA
GATAAVYSAA
GATAAVYAAA
VLEYLTAEIL
VLEYLAAEIL
VLEYLTAEVL
VLEYLCAEIL
VLEYLAAEVL
VLEYLAAEVL
VLEYLTAEIL
VLEYLSAELL
VLEYLTAEIL
ILEYLTAEVL
ILEYLTAEVL
ELAGNAARDN
ELAGNAARDN
ELAGNAARDN
ELAGNAARDN
ELAGNAARDN
ELAGNAAKDN
ELAGNAARDN
ELAGNASRDN
ELAWERLPEI
ELAGNASKDL
ELAGNASKDF
KKTRIIPR
KKTRIIPR
KKNRIVPR
KKSRITPR
KKNRISPR
KKTRIVPR
KKGRVTPR
KKKRITPR
TKRPVLSP
KVKRITPR
KVRRITPR
Partial alignment of casein from PFAM (l = 3.3)
CASK_BOVIN/2-190
CASK_CERNI/2-190
CASK_CAMDR/1-182
CASK_PIG/2-188
CASK_HUMAN/1-182
CASK_RABIT/2-179
CASK_CAVPO/2-181
CASK_MOUSE/2-181
CASK_RAT/2-178
VLSRYPSYGL
ALSRYPSYGL
VQSRYPSYGI
MLNRFPSYGF
VPNSYPYYGT
VMNRYPQYEP
VLNNYLRTAP
VLN.FNQYEP
VLN.RNHYEP
NYYQQKPVAL
NYYQHRPVAL
NYYQHRLAVP
.FYQHRSAVS
NLYQRRPAIA
SYYLRRQAVP
SYYQNRASVP
NYYHYRPSLP
IYYHYRTSVP
.INNQFLPYP
.INNQFLPYP
.INNQFIPYP
.PNRQFIPYP
.INNPYVPRT
.TLNPFMLNP
.INNPYLCHL
ATASPYMYYP
..VSPYAYFP
YYAKPAAVRS
YYVKPGAVRS
NYAKPVAIRL
YYARPVVAGP
YYANPAVVRP
YYVKPIVFKP
YYVPSFVLWA
LVVRLLLLRS
VGLKLLLLRS
PAQILQWQVL
PAQILQWQVL
HAQIPQCQAL
HAQKPQWQDQ
HAQIPQRQYL
NVQVPHWQIL
QGQIPKGPVS
PAPISKWQSM
PAQILKWQPM
Most conserved proteins
in worm, human, and yeast
Protein
H4 histone
H3.3 histone
Actin B
Ubiquitin
Calmodulin
Tubulin
worm/
human
99% id
99
98
98
96
94
See Copley et al. (1999)
worm/
yeast
91% id
89
88
95
59
75
yeast/
human
92 % id
90
89
96
58
76
Sanger and colleagues sequenced insulin (1950s)
Human
chimpanzee
rabbit
dog
horse
mouse
rat
pig
chicken
sheep
bovine
whale
elephant
CGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLEN
CGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLEN
CGERGFFYTPKSRREVEELQVGQAELGGGPGAGGLQPSALELALQKRGIVEQCCTSICSLYQLEN
CGERGFFYTPKARREVEDLQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLEN
CGERGFFYTPKAXXEAEDPQVGEVELGGGPGLGGLQPLALAGPQQXXGIVEQCCTGICSLYQLEN
CGERGFFYTPMSRREVEDPQVAQLELGGGPGAGDLQTLALEVAQQKRGIVDQCCTSICSLYQLEN
CGERGFFYTPMSRREVEDPQVAQLELGGGPGAGDLQTLALEVARQKRGIVDQCCTSICSLYQLEN
CGERGFFYTPKARREAENPQAGAVELGG--GLGGLQALALEGPPQKRGIVEQCCTSICSLYQLEN
CGERGFFYSPKARRDVEQPLVSSPLRG---EAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLEN
CGERGFFYTPKARREVEGPQVGALELAGGPGAG-----GLEGPPQKRGIVEQCCAGVCSLYQLEN
CGERGFFYTPKARREVEGPQVGALELAGGPGAG-----GLEGPPQKRGIVEQCCASVCSLYQLEN
CGERGFFYTPKA-----------------------------------GIVEQCCTSICSLYQLEN
CGERGFFYTPKT-----------------------------------GIVEQCCTGVCSLYQLEN
We can make a multiple sequence alignment of insulins
from various species, and see conserved regions…
Mature insulin consists of an A chain and B chain
heterodimer connected by disulphide bridges
The signal peptide and C peptide are cleaved, and
their sequences display fewer functional constraints.
Note the sequence divergence in the
disulfide loop region of the A chain
0.1 x 10-9
1 x 10-9
0.1 x 10-9
Number of nucleotide substitutions/site/year
http://evolution.genetics.washington.edu/phylip/software.html
This site lists 200 phylogeny packages. Perhaps the bestknown programs are PAUP (David Swofford and colleagues)
and PHYLIP (Joe Felsenstein).
• 在进行系统发育分析的时候,比对中
引入了前导树。由CLUSTAL等比对得
到前导树,转化成PHYLIP树的文件格
式,然后输入到画树程序中,
• 常用的画树程序包括TreeTool(X
windows), phylip,TREEVIEW, PAUP,
MEGA 等。
• 三种主要的建树方法分别是:
•
1. 距离矩阵法 (Distance Matrix)
•
2. 最大简约法 (Maximum Parsimony, MP )
•
3. 最大似然法 (Maximum Likelihood, ML)
• 距离树考察数据组中所有序列的两两比对
结果,通过序列两两之间的差异决定进化
树的拓扑结构和树枝长度。
• 最大节约方法考察数据组中序列的多重比
对结果,优化出进化树。
• 最大似然方法考察数据组中序列的多重比
对结果,优化出拥有一定拓扑结构和树枝
长度的进化 树,这个进化树能够以最大的
概率导致考察的多重比对结果。
距离矩阵法
1. 邻接法 (neighbor-joining method,NJ)
2. UPGMA法
使用这两种方法前都必须获得一个对称距离矩阵 (m
阶方阵) D = {dij}m×m, 其中m为OUT(分类群〕数
目。
距离系数的公式很多。例如,Nei (1972)的遗传距离
系数适用于限制性内切酶和同功酶数据,JukesCantor 单参数距离系数和Kimura两参数模型距离
系数则广泛用于各种序列数据。
Tree-building methods
We will discuss two tree-building methods:
distance-based and character-based.
Distance-based methods involve a distance metric,
such as the number of amino acid changes between
the sequences, or a distance score. Examples of
distance-based algorithms are UPGMA and
neighbor-joining.
Tree-building methods
We can introduce distance-based and characterbased tree-building methods by referring to a tree
of 13 orthologous retinol-binding proteins, and the
multiple sequence alignment from which the tree
was generated.
common carp
zebrafish
rainbow trout
teleost
African
clawed
frog
chicken
human
mouse
rat
horse
pig cow rabbit
10 changes
Orthologs:
members of a
gene (protein)
family in various
organisms.
This tree shows
RBP orthologs.
common carp
zebrafish
Fish RBP
orthologs
rainbow trout
teleost
African
clawed
frog
chicken
human
mouse
rat
horse
pig cow rabbit
10 changes
Other vertebrate
RBP orthologs
Distance-based tree
Calculate the pairwise alignments;
if two sequences are related,
put them next to each other on the tree
Character-based tree: identify
positions that best describe how
characters (amino acids) are
derived from common ancestors
Stage 3: Tree-building methods: distance
Many software packages are available for making
phylogenetic trees. We will describe two programs.
[1] MEGA (Molecular Evolutionary Genetics Analysis) by
Sudhir Kumar, Koichiro Tamura, and Masatoshi Nei.
Download it from http://www.megasoftware.net/
[2] Phylogeny Analysis Using Parsimony (PAUP), written
by David Swofford. See http://paup.csit.fsu.edu/.
We will next use MEGA and PAUP to generate trees by
the distance-based method UPGMA.
How to use MEGA to make a tree
[1] Enter a multiple sequence alignment (.meg) file
[2] Under the phylogeny menu, select one of these
four methods…
Maximum Likelihood (ML)
Neighbor-Joining (NJ)
Minimum Evolution (ME)
UPGMA
Maximum Parsimony (MP)
Use of MEGA for a distance-based tree: UPGMA
Click green boxes
to obtain options
Click compute
to obtain tree
Use of MEGA for a distance-based tree: UPGMA
Use of MEGA for a distance-based tree: UPGMA
A variety of styles are available for tree display
Use of MEGA for a distance-based tree: UPGMA
Flipping branches around a node creates
an equivalent topology
How to use PAUP to make a tree
step 1
Obtain MSF
step 2
Convert
step 3
Import to PAUP
and execute
step 4
step 6
View, export:
Print Trees
step 5
More analyses
(evaluate trees)
Perform analyses
(generate trees)
How to use PAUP to make a tree
Step 1: Get a multiple sequence alignment (e.g. from PFAM)
Step 2: Convert it with ReadSeq
(Google search to identify a site offering ReadSeq,
Such as the Baylor College of Medicine)
Step 3: Import as new file into PAUP
Fig. 11.15
Page 380
PAUP allows input of multiple sequence alignments,
data editing, creation and analysis of phylogenetic trees
Fig. 11.15
Page 380
Making trees using UPGMA
In PAUP, you can set the tree-making criterion to
“distance” then choose UPGMA (unweighted pair
group method with arithmetic mean)
Page 379
PAUP performs UPGMA (distance-based tree)
Fig. 11.16
Page 381
Tree-building methods: UPGMA
UPGMA is
unweighted pair group method
using arithmetic mean
1
2
3
4
5
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of all
the proteins. Get ready to put the numbers 1-5
at the bottom of your new tree.
1
2
3
4
5
Tree-building methods: UPGMA
Step 2: Find the two proteins with the
smallest pairwise distance. Cluster them.
1
2
6
3
4
5
1
2
Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins
with the smallest pairwise distance. Cluster them.
1
2
6
1
3
4
5
7
2
4
5
Tree-building methods: UPGMA
Step 4: Keep going. Cluster.
1
8
2
7
6
3
4
5
1
2
4
5
3
Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
9
1
2
8
7
3
6
4
5
1
2
4
5
3
Making trees using neighbor-joining
The neighbor-joining
method of Saitou and Nei
(1987) Is especially useful
for making a tree having a
large number of taxa.
Begin by placing all the taxa in a star-like structure.
Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closely
related. Connect these neighbors to other OTUs via an
internal branch, XY. At each successive stage, minimize
the sum of the branch lengths.
Tree-building methods: Neighbor joining
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
Example of a
neighbor-joining
tree: phylogenetic
analysis of 13
RBPs
Tree-building methods: character based
Rather than pairwise distances between proteins,
evaluate the aligned columns of amino acid
residues (characters).
Tree-building methods based on characters include
maximum parsimony and maximum likelihood.
As an example of tree-building using maximum
parsimony, consider these four taxa:
AAG
AAA
GGA
AGA
How might they have evolved from a
common ancestor such as AAA?
Tree-building methods: Maximum parsimony
AAA
1
AAA
AAG AAA
1
1
AGA
GGA AGA
Cost = 3
AAA
1
AAA
1
AAG AGA
AAA
AAA
2
AAA GGA
Cost = 4
1
AAA
AAA
2
AAG GGA
1
AAA AGA
Cost = 4
In maximum parsimony, choose the tree(s) with the
lowest cost (shortest branch lengths).
Phylogram
(values are
proportional
to branch
lengths)
Rectangular
phylogram
(values are
proportional
to branch
lengths)
Cladogram
(values are not
proportional
to branch
lengths)
Rectangular
cladogram
(values are not
proportional
to branch
lengths)
These four trees display the same data
in different formats.
Making trees using maximum likelihood
Maximum likelihood is an alternative to maximum
parsimony. It is computationally intensive. A likelihood
is calculated for the probability of each residue in
An alignment, based upon some model of the
substitution process.
ML is implemented in the TREE-PUZZLE program,
as well as PAUP and PHYLIP.
特征符建树方法
• 基于特征符的建树方法是最大节约方法和最
大似然方法。
• 最大节约法(MP,Maximum Parsimony)
最大节约方法是一种优化标准。建立进化树
的原理是要求用最小的改变来解释所要研究
的分类群之间的观察到的差异。特别假定最
少,解释最简单的,在实际应用中,MP进
化树是最短的,变化最少的进化树 。
最大简约法 (Maximum Parsimony Method)
Step 1
输入:多序列对位排列
Step 2
对于每一个对位排列的位置,确定产生
所观测到的序列变化需要最小数目进化变化的树
Step 3
继续对序列对位排列中的每一个位点进
行分析.
Step 4
在对位排列中的每一个位点的序列变异
被置于树的顶端, 确定在所有的序列位点产生最小变
化数量的树。
适合信息位点较多的情形
最大似然方法(ML, Maximum
Likelihood)
ML对系统发育问题进行了彻底搜查。ML期望
能够搜寻出一种进化模型(包括对进化树本身
进行搜索),使得这个模型所能产生的数据与
观察到的数据最相似。
ML计算一个位点遵循一个特定取代过程时所
得到的变化模式的概率;似然值就是把在这个
特定的取代过程中每一个可能的取代的再现的
概率进行加和。所有位点的似然值相乘就得到
了整个进化树的似然值。
最大似然法(Maximum likelihood)
利用概率计算来发现最能反映序列变异的方法。
对多序列对位排列的每一个列进行分析。所有的树都要考虑。
序列变化的进化模型提供了一个碱基变为另一个碱基的速率的
估计:
Base
A
C
G
T
A
-u(aC+bG+cT)
ugA
uhA
uiA
C
uaC
-u(gA+dG+eT)
ujG
ukG
G
ubG
udG
-u(hA+jG+fT)
ulT
T
ucT
ueT
ufT
-u(iA+kG+lT)
最大似然法步骤
Step1: 序列集的对位排列
Step2 : 检测在每一列中的替代是否符合一组描述序列间系统
发育关系的树。
基于所给的数据集,每一个树有一个可能性。
Ptreei
branchn i
mutation
rate
branch1 i
branchn i
rate of substituti on in branch i length
of branch(i)
branch1 i
优点:可用于评价速率变异的树, 可以被用于分歧较大的序列。
缺点: 计算量大.
NEXUS 格式
((IM21:100.0,((((((Pa10:100.0,((((NI1k:100.0,NIM3:100.0):84
.0,MU4k:100.0):79.0,
(((LZ11:100.0,PT18:100.0):71.0,LR20:100.0):19.0,FL19:100.0
):13.0):5.0,(AC15:100.0,
(MC16:100.0,FU14:100.0):99.0):89.0):13.0):6.0,((PI7k:100.0,
TU6k:100.0):45.0,TE80:100.0):33.0):11.0,
LG12:100.0):15.0,(XI22:100.0,(CH17:100.0,GR13:100.0):104
.0):89.0):19.0,PU5k:100.0):34.0,
LI90:100.0):43.0):61.0,out:100.0);
Guinea-pig
Rodents polyphyly?
Tree-2
Tree-1
Mouse
Human
Guinea-pig
Guinea-pig
Human
Mouse
Rodents
Traditional view
D'Erchia et al. (1996)
Nature 381, 597
Tree-3
Mouse
Human
Guinea-pig
Graur, Hide and Li (1991)
Nature 351, 649
ProtML
Reyes et al. (2000)
Rodent polyphyly?
ME
HIV 从哪里来?
Freeman & Herron, 2001. Evolutionary Analysis. Prentice Hall
2003/6/13 Science
来自不同种类猴子的两个病毒
在非洲黑猩猩体内经重组后形成了
引发人类艾滋病的SIV菌株
SIVcpz是通过来自红盖猴和花鼻
猴的SIVs病毒不断地传播和重组的
过程变成了起源于黑猩猩的SIVcpz
的。黑猩猩捕食这两种猴子。这些
猴子和黑猩猩在西部中非洲有重叠
的活动区域。
人类不是通过自然状态下物种间
的传播而获得两种不同SIVs菌株的
唯一物种,这种自然状态下的物种
间传播很可能是由捕食行为产生的
。
黑猩猩捕食小型猴子是不是导致
了它们获得其它的SIV感染? 这些
SIV与SIVcpa的共同感染或与
SIVcpz进行重组可能性有多大? 这
些适应了黑猩猩的SIV是不是最终
更可能感染人类?
Hasegawa, 1998
TreeBASE at Harvard Univerity
TreeFam: Tree families database
http://treefam.genomics.org.cn/
Pfam: Protein families database
http://pfam.xfam.org/
Stage 4: Evaluating trees
The main criteria by which the accuracy of a
phylogentic tree is assessed are consistency,
efficiency, and robustness. Evaluation of accuracy
can refer to an approach (e.g. UPGMA) or
to a particular tree.
严格一致树
多数一致树
Stage 4: Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to
measuring the robustness of a tree topology.
Given a branching order, how consistently does
an algorithm find that branching order in a
randomly permuted version of the original data set?
Stage 4: Evaluating trees: bootstrapping
Bootstrapping is a commonly used approach to
measuring the robustness of a tree topology.
Given a branching order, how consistently does
an algorithm find that branching order in a
randomly permuted version of the original data set?
To bootstrap, make an artificial dataset obtained by
randomly sampling columns from your multiple
sequence alignment. Make the dataset the same size
as the original. Do 100 (to 1,000) bootstrap replicates.
Observe the percent of cases in which the assignment
of clades in the original tree is supported by the
bootstrap replicates. >70% is considered significant.
自展 (Bootstrap)
In 61% of the bootstrap
resamplings, ssrbp and btrbp
(pig and cow RBP) formed a
distinct clade. In 39% of the
cases, another protein joined
the clade (e.g. ecrbp), or one
of these two sequences joined
another clade.
单基因系统发育分析的方法
选择相关序列集
多重序列
对位排列
是否相似
性很高?
Yes
MP方法
No
是否有较明显
的序列相似性?
Yes
距离法
No
ML方法
分析数据对于假
设的支持程度 (自
展)
系统发育模型的组成
• 系统发育的建树方法都会预先假定一个进
化模型。 比如,所有广泛使用的方法都假
定进化的分歧是严格分枝的,因此我们可
以用树状拓扑发生图来描述已知的数据。
• 在一个给定的数据组中,因为存在着物种的杂交
以及物种之间遗传物质的传递,这个假定很可能
会被推翻。因此,如果所观察的序列并非是严格
遗传的话,大多数系统发育方法就会得到错误的
结果。
• 用计算的方法进行系统发育分析的缺点:
很容易得到错误的结果,而且出错的危险几乎是
不可避免的;其它学科一般都会有实验基础,而
系统发育分析不太可能会拥有实验基础,至多也
就是一些模拟实验或者病毒实验;
实际上,系统发育的发生过程都是已经完成的历
史,只能去推断或者评估,而无法再现了。
• More and more LGT(Lateral Gene Transfer )
were discovered and reported. Some people
guess 1.5%~14.5% of genes in a genome are
related with LGT, even rRNA molecules are
involved in LGT;
Garcia-Vallvé S, Romeu A, Palau J. ,Genome Res, 2000, 11,
1719~1725
Yap W H, Zhang Z, Wang Y. , J. Bacteriol. 1999, 181: 5201~5209
• Some people argue it is impossible to reconstruct
a universal life tree;
Pennisi E. ,Science, 1999, 284: 1305~1307
Doolittle R F.,Nature, 1998, 392: 339~342
• As more and more whole genome sequence and
the related data become available, it is possible
to re-consider the phylogeny and clustering
properties of species in more broad
measurements, even in level of whole genome.
相关网址
• Compilation of available phylogeny programs
http://evolution.genetics.washington.edu/phylip/software.html
• BLAST2 & Orthologue Search http://www.Bork.EMBLHeidelberg.DE/Blast2e/
• CLUSTAL W http://www-igbmc.u-strasbg.fr/BioInfo/
• PHYLIP http://evolution.genetics.washington.edu/phylip.html
• PhyloBLAST http://www.pathogenomics.bc.ca/phyloBLAST/
• Phylogenetic Resources
http://www.ucmp.berkeley.edu/subway/phylogen.html
• PUZZLEBOOT http://www.tree-puzzle.de
• TreeView http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
• WebPHYLIP http://sdmc.krdl.org.sg:8080/lxzhang/phylip/
进化分析相关软件的因特网地址
********************************************************
序列分析和多序列比较
# BLAST Web site
http://www.ncbi.nlm.nih.gov/BLAST/
# FASTA at EBI
http://www2.ebi.ac.uk/fasta3/
# CLUSTALW software
ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW
# HMMER software
http://hmmer.wustl.edu/
# SAM profile software
http://www.cse.ucsc.edu/research/compbio/sam.html
# BCM Search Launcher
http://kiwi.imgen.bcm.tmc.edu:8088/searchlauncher/launcher.html
系统进化树构建和稳定性分析
# PHYLIP
http://evolution.genetics.washington.edu/phylip.html
# Hennig86
http://www.vims.edu/~mes/hennig/software.html
# MEGA/METREE
http://www.bio.psu.edu/faculty/nei/imeg
# GAMBIT
http://www.lifesci.ucla.edu/mcdbio/Faculty/Lake/Research/Programs/
# MacClade
http://phylogeny.arizona.edu/macclade/macclade.html
# PAUP
http://onyx.si.edu/PAUP/
# GCG software package
http://www.gcg.com/
*******************************************************
Neutral theory of evolution
An often-held view of evolution is that just as organisms
propagate through natural selection, so also DNA and
protein molecules are selected for.
According to Motoo Kimura’s 1968 neutral theory
of molecular evolution, the vast majority of DNA
changes are not selected for in a Darwinian sense.
The main cause of evolutionary change is random
drift of mutant alleles that are selectively neutral
(or nearly neutral). Positive Darwinian selection does
occur, but it has a limited role.
As an example, the divergent C peptide of insulin
changes according to the neutral mutation rate.
分子进化与系统发育
[美] 根井正利
苏德海尔·库马 著
吕宝忠 钟 扬 高莉萍 等译
赵寿元 张建之 等校
高等教育出版社
2002年6月 北京