Download PDF handout

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Metalloprotein wikipedia , lookup

Gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Community fingerprinting wikipedia , lookup

Biochemistry wikipedia , lookup

Proteolysis wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Catalytic triad wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Molecular evolution wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Bioinformatics
Introduction to bioinformatics
2007
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
Lecture 5
Pair-wise Sequence Alignment
“Nothing in bioinformatics makes sense
except in the light of Biology”
Divergent evolution
Divergent evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
or
Ancestral sequence: ABCD
ABD (C ø)
ACCD (B C)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
ACCD
AB─D
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
true alignment
What can be observed about
divergent evolution
(a)
G
(b)
Convergent evolution
• Often with shorter motifs (e.g. active sites)
• Motif (function) has evolved more than once
independently, e.g. starting with two very different
sequences adopting different folds
• Sequences and associated structures remain
different, but (functional) motif can become
identical
• Classical example: serine proteinase and
chymotrypsin
G
Ancestral sequence
G
Sequence 1
1: ACCTGTAATC
2: ACGTGCGATC
* **
D = 3/10 (fraction different
sites (nucleotides))
C
A
One substitution one visible
Sequence 2
(c)
G
C
Two substitutions one visible
(d)
G
G
A
A
Two substitutions none visible
A
Back
mutation not visible
G
1
Serine proteinase (subtilisin) and
chymotrypsin
• Different evolutionary origins
• Similarities in the reaction mechanisms. Chymotrypsin,
subtilisin and carboxypeptidase C have a catalytic triad of
serine, aspartate and histidine in common: serine acts as a
nucleophile, aspartate as an electrophile, and histidine as a
base.
• The geometric orientations of the catalytic residues are
similar between families, despite different protein folds.
• The linear arrangements of the catalytic residues reflect
different family relationships. For example the catalytic
triad in the chymotrypsin clan is ordered HDS, but is
ordered DHS in the subtilisin clan and SDH in the
carboxypeptidase clan.
Serine proteinase (subtilisin) and
chymotrypsin
H
D
S
chymotrypsin
D
H
S
serine proteinase
S
D
H
carboxypeptidase C
Catalytic triads
Read http://www.ebi.ac.uk/interpro/potm/2003_5/Page1.htm
Serine proteinase (subtilisin) and
chymotrypsin
Serine proteinase (subtilisin) and
chymotrypsin
There is also divergent evolution..
Proc Natl Acad Sci U S A. 2000 December 19; 97(26): 14097–
14102.
The structure of aspartyl
dipeptidase reveals a unique fold
with a Ser-His-Glu catalytic triad
Kjell Håkansson,* Andrew H.-J. Wang,† and Charles G.
Miller*‡
Serine proteinase (subtilisin) and
chymotrypsin
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
* *
* **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc---***
**** **** **
******
2
Searching for similarities
Intermezzo: what is a domain
A domain is a:
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial
experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the
unknown sequence
– Identify similarities and differences
• Compact, semi-independent unit
(Richardson, 1981).
• Stable unit of a protein structure that can
fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary
module (Bork, 1992).
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
– For long proteins: first identify domains
Protein domains recur in different combinations
The DEATH Domain (DD)
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Structural domain organisation can intricate…
Pyruvate kinase
Phosphotransferase
β barrel regulatory domain
α/β barrel catalytic substrate binding
domain
http://www.mshri.on.ca/pawson
α/β nucleotide binding domain
1 continuous + 2 discontinuous domains
Evolutionary and functional relationships
Reconstruct evolutionary relation:
•Based on sequence
-Identity (simplest method)
-Similarity
•Homology (common ancestry: the ultimate goal)
•Other (e.g., 3D structure)
Functional relation:
Sequence Structure
Function
Searching for similarities
Common ancestry is more interesting:
Makes it more likely that genes share
the same function
Homology: sharing a common ancestor
– a binary property (yes/no)
– it’s a nice tool:
When (an unknown) gene X is homologous to (a
known) gene G it means that we gain a lot of
information on X: what we know about G can be
transferred to X as a good suggestion.
3
How to go from DNA to protein
sequence
How to go from DNA to protein
sequence
6-frame translation using the codon table (last lecture):
A piece of double stranded DNA:
5’ attcgttggcaaatcgcccctatccggc 3’
5’ attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
DNA direction is from 5’ to 3’
3’ taagcaaccgtttagcggggataggccg 5’
Evolution and three-dimensional protein structure
information
Bioinformatics tool
Isocitrate
dehydrogenase:
Algorithm
The distance from
the active site
(in yellow) determines
the rate of evolution
(red = fast evolution,
blue = slow evolution)
Data
tool
Biological
Interpretation
(model)
Dean, A. M. and G. B.
Golding: Pacific Symposium
on Bioinformatics 2000
Example today: Pairwise sequence
alignment needs sense of evolution
Global dynamic programming
MDAGSTVILCFVG
Evolution
M
D
A
A
S
T
I
L
C
G
S
How to determine similarity
Frequent evolutionary events at the
DNA level:
1. Substitution
2. Insertion, deletion
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Amino Acid Exchange
Matrix
We will restrict
ourselves to these
events
3. Duplication
4. Inversion
Gap penalties
(open,extension)
4
Dynamic programming
Scoring alignments
– Substitution (or match/mismatch)
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
* *
* **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc---***
**** **** **
******
Dynamic programming
∑ s (a , b ) - ∑
l
i
j
gp(k) = gapinit + k⋅gapextension
k
• proteins
– Gap penalty
• Linear: gp(k)=ak
• Affine: gp(k)=b+ak
• Concave, e.g.: gp(k)=log(k)
The score for an alignment is the sum of the scores of all
alignment columns
DNA: define a score for match/mismatch of letters
Simple:
Scoring alignments
Sa,b =
• DNA
Nk • gp (k )
affine gap penalties
A
C
G
T
A
1
-1
-1
-1
C
-1
1
-1
-1
G
-1
-1
1
-1
T
-1
-1
-1
1
Used in genome alignments:
A
C
G
T
A
91
-114
-31
-123
C
-114
100
-125
-31
G
-31
-125
100
-114
T
-123
-31
-114
91
Dynamic programming
Amino acid exchange matrices
Scoring alignments
T D W V T A L K
T D W L - - I K
20×20
How do we get one?
20×20
10
Amino Acid Exchange Matrix
1
Affine gap penalties
(open, extension)
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px +
+s(L,I)+s(K,K)
And how do we get associated gap penalties?
First systematic method to derive a.a.
exchange matrices by Margaret Dayhoff et
al. (1968) – Atlas of Protein Structure
5
A
2
R -2
6
N
0
0
2
D
0 -1
2
PAM250 matrix
4
C -2 -4 -4 -5 12
Q
0
1
1
2 -5
4
E
0 -1
1
3 -5
2
4
G
1 -3
0
1 -3 -1
0
2
1 -3
1 -2
H -1
2
3
5
6
I -1 -2 -2 -2 -2 -2 -2 -3 -2
5
L -2 -3 -3 -4 -6 -2 -3 -4 -2
2
1
0 -5
1
amino acid
exchange matrix
(log odds)
0 -2
6
K -1
3
M -1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F -4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
9
1
0 -1 -1 -3
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
2
T
1 -1
0
0 -2 -1
0
0 -1
0 -2 -3 -1 -2 -5
0 -1 -3
0
1
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
Y -3 -4 -2 -4
0 -4 -4 -5
0 -1 -1 -4 -2
3
0 -6 -2 -5 17
7 -5 -3 -3
B
0 -1
2
3 -4
1
2
0
1 -2 -3
Z
0
0
1
3 -5
3
3 -1
2 -2 -3
0 -2 -5
0
0 -1 -6 -4 -2
2
3
A
R
N
D
Q
E
H
K
P
S
B
Z
I
L
2 -1 -1 -1
0 10
0 -2 -2 -2 -2 -2 -2 -1 -2
G
2 -2
6
V
C
4
Positive exchange values
denote mutations that are
more likely than randomly
expected, while negative
numbers correspond to
avoided mutations compared
to the randomly expected
situation
5
P
W -6
0 -1 -1
0 -2 -3
1 -2 -5 -1
M
F
0
0 -6 -2
W
Y
V
2
Pair-wise alignment
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
~
=
n
22n
(2n)!
(n!)2
Amino acids are not equal:
1. Some are easily substituted because they have
similar:
• physico-chemical properties
• structure
2. Some mutations between amino acids occur more
often due to similar codons
4
0 -5 -3 -2
T
Amino acid exchange matrices
√πn
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
To say the same more statistically…
•To perform statistical analyses on messages or sequences, we need a
reference model.
•The model: each letter in a sequence is selected from a defined
alphabet in an independent and identically distributed (i.i.d.) manner.
•This choice of model system will allow us to compute the statistical
significance of certain characteristics of a sequence, its subsequences,
or an alignment.
•Given a probability distribution, Pi, for the letters in a i.i.d. message,
the probability of seeing a particular sequence of letters i, j, k, ... n is
simply Pi Pj Pk···Pn.
•As an alternative to multiplication of the probabilities, we could sum
their logarithms and exponentiate the result. The probability of the
same sequence of letters can be computed by exponentiating log Pi +
log Pj + log Pk+ ··· + log Pn.
•In practice, when aligning sequences we only add log-odds values
(residue exchange matrix) but we do not exponentiate the final score.
The two above observations give us ways to define
substitution matrices
Technique to overcome the
combinatorial explosion:
Dynamic Programming
• Alignment is simulated as Markov process,
all sequence positions are seen as
independent
• Chances of sequence events are independent
– Therefore, probabilities per aligned position
need to be multiplied
– Amino acid matrices contain so-called log-odds
values (log10 of the probabilities), so
probabilities can be summed
Sequence alignment
History of Dynamic
Programming algorithm
1970 Needleman-Wunsch global pair-wise
alignment
Needleman SB, Wunsch CD (1970) A general method applicable to the search for
similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.
1981 Smith-Waterman local pair-wise
alignment
Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences.
J. Mol. Biol. 147, 195-197.
6
Pairwise sequence alignment
Global dynamic programming
Global dynamic programming
MDAGSTVILCFVG
j-1 j
Evolution
M
D
A
A
S
T
I
L
C
G
S
i-1
i
Value from
residue exchange
matrix
Amino Acid Exchange
Matrix
Search matrix
H(i,j) = Max
Gap penalties
(open,extension)
MDAGSTVILCFVGMDAAST-ILC--GS
This is a recursive
formula
Global dynamic programming
Global dynamic programming
PAM250, Gap =6 (linear)
S
P
E
A
R
E
0
-6
-12
-18
-24
-30
-36
S
-6
2
-4
-10
-16
-22
-28
0
H
-12
-4
2
-3
-9
-14
-20
-24
3
0
A
-18
-10
-3
0
-1
-7
-13
-18
-1
4
K
-24
-16
-9
-3
-1
2
-4
-12
E
-30
-22
-15
-5
-3
-2
6
-6
-30
-24
-18
-12
-6
0
S
P
E
A
R
E
S
2
1
0
1
0
0
H
-1
0
1
-1
2
1
A
1
1
0
2
-2
K
0
-1
0
-1
E
0
-1
4
0
These values are copied from
the PAM250 matrix (see
earlier slide)
Affine gap penalties
j-1
i-1
Gap
opening
penalty
The extra bottom row and rightmost column give the
penalties that would need to be applied due to end gaps
Si,j = si,j + Max
Higgs & Attwood, p. 124
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
Gap extension penalty
Easy DP recipe for using affine
gap penalties
j-1
Global dynamic programming
Gapo=10, Gape=2
D
W
V
T
A
L
K
0
-12
-14
-16
-18
-20
-22
-24
T
-12
8
-9
-6
-5
-9
-11
-14
5
D
-14
0
9
2
2
3
-5
-3
-34
10
6
W
-16
-13
25
11
5
4
9
0
-21
6
14
5
V
-18
-10
-4
37
21
19
19
15
-16
7
5
13
L
-20
-14
-2
23
46
31
37
26
1
K
-22
-12
-9
17
33
53
39
50
14
-34
-29
-1
17
39
27
50
D
W
V
T
A
L
K
T
8
3
8
11
9
9
8
D
12
1
6
8
8
4
8
W
1
25
2
3
2
6
V
6
2
12
8
8
L
4
6
10
9
K
8
5
6
8
These values are copied from
the PAM250 matrix (see
earlier slide), after being made
non-negative by adding 8 to
each PAM250 matrix cell (-8 is
the lowest number in the
PAM250 matrix)
H(i-1,j-1) + S(i,j) diagonal
H(i-1,j) - g
vertical
H(i,j-1) - g
horizontal
The extra bottom row and rightmost column give the
final global alignment scores
i-1
• M[i,j] is optimal alignment (highest scoring
alignment until [i,j])
• Check
– preceding row until j-2: apply appropriate gap penalties
– preceding row until i-2: apply appropriate gap penalties
– and cell[i-1, j-1]: apply score for cell[i-1, j-1]
7
Global dynamic programming
DP is a two-step process
• Forward step: calculate scores
• Trace back: start at highest score and
reconstruct the path leading to the highest
score
– These two steps lead to the highest scoring
alignment (the optimal alignment)
– This is guaranteed when you use DP!
Semi-global dynamic programming
Semi-global pairwise alignment
- two examples with different gap penalties These values are copied from
the PAM250 matrix (see
earlier slide), after being made
non-negative by adding 8 to
each PAM250 matrix cell (-8 is
the lowest number in the
PAM250 matrix)
• Global alignment: all gaps are penalised
• Semi-global alignment: N- and C-terminal gaps
(end-gaps) are not penalised
End-gaps
MSTGAVLIY--TS-------GGILLFHRTSGTSNS
Global score is 65 –
10 – 1*2 –10 – 2*2
End-gaps
Local dynamic programming
(Smith & Waterman, 1981)
Semi-global pairwise alignment
LCFVMLAGSTVIVGTR
Applications of semi-global:
– Finding a gene in genome
– Placing marker onto a chromosome
– One sequence much longer than the other
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
Danger: if gap penalties high -- really bad
alignments for divergent sequences
AGSTVIVG
A-STILCG
Gap penalties
(open,
extension)
8
Local dynamic programming
(Smith & Waterman, 1981)
Local dynamic programming
j-1
i-1
Gap
opening
penalty
Si,j = Max
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
Gap extension penalty
Dot plots
• Way of representing (visualising) sequence
similarity without doing dynamic
programming (DP)
• Make same matrix, but locally represent
sequence similarity by averaging using a
window
Comparing two sequences
We want to be able to choose the best alignment between two
sequences.
A simple method of visualising similarities between two sequences
is to use dot plots. The first sequence to be compared is assigned to
the horizontal axis and the second is assigned to the vertical axis.
Dot plots can be filtered by
window approaches (to
calculate running averages)
and applying a threshold
They can identify
insertions, deletions,
inversions
9