Download phylogenetic tree

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Gene expression programming wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Multiple Alignment by profile
HMM training
and
Phylogenetic Trees
Elze de Groot
&
Anastacia Berdnikova
Elze de Groot & Anastasia
Berdnikova
1
Topics

Multiple alignment with known HMM
 HMM training from unaligned sequences
 Avoiding local maxima
– Simulated annealing
– Noise injection
– Stochastic sampling traceback algorithm

Model surgery
 Phylogenetic trees
Elze de Groot & Anastasia Berdnikova
2
Multiple alignment with known
profile HMM

Multiple alignment and model known ->
align large number of other family members
 Calculating Viterbi alignment for every
sequence
 Residues in same match state are aligned in
columns
 That´s a difference between profile HMM
and traditional multiple alignment
Elze de Groot & Anastasia Berdnikova
3
Example

Model estimated from an alignment
Elze de Groot & Anastasia Berdnikova
4
Example continued

The most probable paths and alignment
Elze de Groot & Anastasia Berdnikova
5
Profile HMM training from
unaligned sequences

Algorithm:
Elze de Groot & Anastasia Berdnikova
6
Initial Model

Choose length of model
- M is number of match states
- set M to be the average length

Choose initial models carefully
 Randomness in choice of initial model
Elze de Groot & Anastasia Berdnikova
7
Parameter Estimation

Use forward and backward variables to reestimate emission and transition probability
parameters

Baum-Welch re-estimation can be replaced
by viterbi alternative
Elze de Groot & Anastasia Berdnikova
8
Forward Algorithm
Elze de Groot & Anastasia Berdnikova
9
Backward algorithm
Elze de Groot & Anastasia Berdnikova
10
Baum-Welch re-estimation
equations

Expected emission counts from sequence x
1
EM k ( a ) 
f M k (i )bM k (i )

P( x) i| xi  a
1
EI k (a) 
f I k (i )bI k (i )

P( x) i| xi  a
Elze de Groot & Anastasia Berdnikova
11
Baum-Welch re-estimation
equations

Expected transition counts from sequence x
1
AX k M k 1 
f X k (i )a X k M k 1 eM k 1 ( xi  1)bM k 1 (i  1)

P( x) i
1
AX k I k 
f X k (i )a X k I i eI k ( xi  1)bI k (i  1)

P( x) i
1
AX k Dk 1 
f X k (i )a X k Dk 1 bDk 1 (i  1)

P( x) i
Elze de Groot & Anastasia Berdnikova
12
Avoiding local maxima

Baum-Welch guaranteed to find local
maxima
 Not guaranteed it is anywhere near global
optimum or biologically reasonable solution
 Reason: models are long -> many options to
get wrong solution
Elze de Groot & Anastasia Berdnikova
13
Avoiding local maxima

Use stochastic search algorithm

Commonly used: Simulated annealing
Elze de Groot & Anastasia Berdnikova
14
Simulated annealing

Some compounds only cristallise if they are
slowly annealed from high to low
temperature
 Optimisation problem: minimise function
´energy´ E(x)
 Maximising function same as minimising
negative value of function
Elze de Groot & Anastasia Berdnikova
15
Simulated annealing (2)
´temperature´ T
 Probability of ´state´ x is given by Gibbs
distribution



1
 1

P(x)  exp   E(x)
Z
 T

 1

Z   exp   E ( x) dx
 T

x usually multidimensional so impossible to
calculate Z
Partition function:
Elze de Groot & Anastasia Berdnikova
16
Simulated annealing (3)

T0, all configurations except with lowest
energy are prob 0 (system is ´frozen´)
 T, All configuration have same prob
(system is ´molten´)
 With crystallisation: minimum can be found
by sampling this distribution at high
temperature first and then decreasing
temperatures
Elze de Groot & Anastasia Berdnikova
17
Simulated annealing for HMM

Natural energy function negative log of
likelihood –logP(data|)
1/ T
1
P
(
data
|

)
 1
 1
exp    log P(data |  )  P(data |  )1/ T 
1/ T
Z
P
(
data
|

´)
d ´
 T
 Z


Non-trivial, the two methods I´m going to
mention are approximations
Elze de Groot & Anastasia Berdnikova
18
Noise injection

Adding noise to counts estimated in
forward-backward procedure and let size of
noise decrease slowly
 In Krogh et al.[1994] the noise was
generated by a random walk in the initial
model
Elze de Groot & Anastasia Berdnikova
19
Simulated annealing Viterbi
estimation

If there are N sequences, there´s an exact
translation from the N paths 1,…, N to the
parameters of the model
 Treat the paths as fundamental parameters
in which to maximise the likelihood
 Simulated annealing done in these variables
instead of the model parameters
Elze de Groot & Anastasia Berdnikova
20
Simulated annealing Viterbi
estimation
P( , x |  )1/ T
Prob( ) 
1/ T
P
(

´,
x
|

)
 ´

Denominator is Z, the partition function ->
sum over all paths
 Can be obtained by modified forward
algorithm using exponentiated transmission
and emission parameters
Elze de Groot & Anastasia Berdnikova
21
Simulated annealing Viterbi
estimation

Exponentiated transmission parameter
– âij = aij1/T

Exponentiated emission parameter
– êj(x) = ej(x)1/T

Used in place of unmodified probability
parameters in forward algorithm
 Z is result of forward algorithm
Elze de Groot & Anastasia Berdnikova
22
Simulated annealing Viterbi
estimation

Algorithm: Stochastic sampling traceback
algorithm for HMMs
Initialisation: πL+1 = End.
Recursion: for L+1 ≥ i ≥ 1,
Prob i 1 |  i   f i 1, i1 â i1 , i / k f i 1,k âi , i
Elze de Groot & Anastasia Berdnikova
23
Simulated annealing Viterbi vs
Viterbi

Key difference:

Viterbi selects highest probable path for
each sequence
 Simulated annealing samples each path
according to the likelihood of the path
Elze de Groot & Anastasia Berdnikova
24
Model Surgery

During training a model two things can
happen:
 (a) some match states are redundant and
should be absorbed in insert state
 (b) one or more insert states aborb too much
sequence, in which case they should be
expanded
Elze de Groot & Anastasia Berdnikova
25
Model Surgery

How much is a certain transition used by
training sequences

Usage of match state is sum of counts for all
letters in state
Elze de Groot & Anastasia Berdnikova
26
Model surgery

If match state is used by less than ½
sequences -> delete module

If more than ½ of sequences use the
transitions into an insert state, this is
expanded to new modules
Elze de Groot & Anastasia Berdnikova
27
Model surgery – Example
SAM

I tried a sequence in SAM with and without
model surgery
 Same 7 sequences as in example before
 Parameters <cutinsert 0.25> <cutmatch 0.5>
-> delete any match state used by fewer
than half the sequences, and insert match
states for any insert node used by greater
than one quarter of the sequences
Elze de Groot & Anastasia Berdnikova
28
Model surgery – Example
SAM

Without model surgery
>seq1
FPHFD.....L...S.....-HGSAQ
>seq2
FESFG.....D...LstpdaVMGNPK
>seq3
FDRFKhlkteA...E.....MKASED
>seq4
FTQFA.....G...Kdles.IKGTAP
>seq5
FPKFK.....G...LttadqLKKSAD
>seq6
FSFLK.....GtseV.....PQNNPE
>seq7
FGFSG.....A...-.....--SDPG

With model surgery
>seq1
FPHF.DLS-..-..--HGSAQ
>seq2
FESF.GDLStpD..AVMGNPK
>seq3
FDRF.KHLK..TeaEMKASED
>seq4
FTQFaGKDL..E..SIKGTAP
>seq5
FPKF.KGLTtaD..QLKKSAD
>seq6
FSFL.KGTS..E..VPQNNPE
>seq7
FGFS.G---..-..--ASDPG
Elze de Groot & Anastasia Berdnikova
29
Building phylogenetic trees
Elze de Groot & Anastasia
Berdnikova
30
Overview
The tree of life – description
 Background on trees

Elze de Groot & Anastasia Berdnikova
31
Multiple alignment and trees

Alignment of sequences should take
account of their evolutionary relationship.
[Sankoff, Morel & Cedergren, 1973]

Several progressive alignment algorithms
use a ‘guide tree’ (to guide the clustering
process).

We begin to build trees.
Elze de Groot & Anastasia Berdnikova
32
The tree of life

The similarity of molecular mechanisms of the
organisms that have been studied strongly
suggests that all organisms on Earth had a
common ancestor. Thus any sets of species is
related, and this relationship is called a
phylogeny.
 Usually the relationship can be represented by
a phylogenetic tree.
Elze de Groot & Anastasia Berdnikova
33

Zuckerkandl & Pauling’s paper [1962] showed
that molecular sequences provide sets of
morphological characters that can carry a large
amount of information.

An assumption: the sequencies we want to
analyze on the phylogeny matter have
descended from some common ancestral gene
in a common ancestral species.

Gene duplication exists => we have to check
the assumption carefully.
Elze de Groot & Anastasia Berdnikova
34
Gene duplication and speciation

By another mechanism, gene duplication, two
sequences can also be separated and diverge
from the common ancestor.

Genes which diverged because of speciation
are called orthologues. Genes which diverged
by gene duplication are called paralogues.
Elze de Groot & Anastasia Berdnikova
35
A tree of orthologues: alpha haemoglobins HBA_ACCGE,
HBA_AEGMO, HBA_AILFU, HBA_AILME, HBA_ALCAA,
HBA_ALLMI, HBA_AMBME, HBA_ANAPL (SWISS-PROT).
Elze de Groot & Anastasia Berdnikova
36
A tree of paralogues: HBAT_HUMAN, HBAZ_HUMAN,
HBA_HUMAN, HBB_HUMAN, HBD_HUMAN, HBE_HUMAN,
HBG_HUMAN, MYG_HUMAN (SWISS-PROT).
Elze de Groot & Anastasia Berdnikova
37
Background on trees

All trees will be assumed to be binary (an edge
that branches splits into two daughter edges).
 Each edge of the tree has a certain amount of
evolutionary divergence associated to it. We
adopt the general term ‘length’, which will be
represented by lengthes of edges on figures.
 A true biological phylogeny has a ‘root’, or
ultimate ancestor of all sequences.
Elze de Groot & Anastasia Berdnikova
38
Rooted and unrooted tree
Elze de Groot & Anastasia Berdnikova
39

A tree with a given labelling will be called a
labelled branching pattern.

We refer to this as the tree topology and denote
it by T.

Lengths of the edges: ti with a suitable
numbering scheme for the is.
Elze de Groot & Anastasia Berdnikova
40
Counting and labelling
Rooted tree:
– n leaves, plus (n-1) branch nodes in
addition to leaves -> we have 2n-1 nodes
in all, and 2n-2 edges.
– leaves – 1..n, branch nodes – n+1 .. 2n-1,
(2n-1)th node is root.
Elze de Groot & Anastasia Berdnikova
41
Counting and labelling
Unrooted tree:
– n leaves, 2n-2 nodes and 2n-3 edges.
– a root can be added at any of its
edges => we can get 2n-3 rooted
trees.
Elze de Groot & Anastasia Berdnikova
42
Number of rooted and unrooted
trees
A root can be added at any edge, producing 2n-3
rooted trees from unrooted tree => there are
(2n-3) times as many rooted trees as unrooted
trees, for a given number n of leaves.
Elze de Groot & Anastasia Berdnikova
43
Instead of the root, we can add an extra edge or
‘branch’ with a distinct label in its leaf.
Elze de Groot & Anastasia Berdnikova
44
● There are three such trees with (2n-3)=5 leaves
– they are distinct labelled branching patterns.
● There are then five ways of adding a further
branch labelled with a distinct label (‘5’),
giving in all 3x5=15 unrooted trees with five
leaves.
● The number of unrooted trees with n leaves is
equal to 3*5*...*(2n-5) = (2n-5)!! So, we have
(2n-3)!! rooted trees with n leaves.
Elze de Groot & Anastasia Berdnikova
45
Building phylogenetic trees
Questions?
Elze de Groot & Anastasia
Berdnikova
46
Exercise 7.2

The trees with three and four leaves in
Figure 7.3 all have the same unlabelled
branching pattern. For both rooted and
unrooted trees, how many leaves do there
have to be to obtain more than one
unlabelled branching pattern? Find a
recurrence relation for the number of rooted
trees. (Hint: consider the trees formed by
joining two trees at their root).
Elze de Groot & Anastasia Berdnikova
47
Exercise 7.2
Elze de Groot & Anastasia Berdnikova
48
Exercise 7.3

All trees considered so far have been binary,
but one can envisage ternary trees that, in
their rooted form, have three branches
descending from a branch node. If there are
m branch nodes in an unrooted ternary tree,
how many leaves are there and how many
edges?
Elze de Groot & Anastasia Berdnikova
49
Exercise 7.4

Consider next a composite unrooted tree with m ternary
branch nodes and n binary branch nodes. How many leaves
are there, and how many edges? Let Nm,n denote the
number of distinct labelled branching patterns of this tree.
Extend the counting argument for binary trees to show that
Nm,n = (3m+2n-1)N m,n-1 + (n+1)N m-1,n+1
(Hint: the first term after the ‘=’ counts the number of ways
that a new edge can be added to an existing edge, thereby
creating an additional binary node; the second term
corresponds to edges added at binary nodes, thereby
producing ternary nodes.)
Elze de Groot & Anastasia Berdnikova
50