Download Bioinformatics for Vet. Part XVII

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bioinformatics for Vet. Part V
Sung Youn Lee, PhD. Student
Veterinary collage, Room 320
02 450 3719, 016 293 6059
[email protected]
Sequence alignment
• Are two sequences related?
– Ex) PEAR and TEAR
PEAR
: : :
TEAR
• If they are related, they might be
functionally, structurally and/or evolutionally
related.
Sequence alignment
• Are two sequences related?
– Ex) ALIGNMENT and LIGAMENT
ALIGNMENT
: : :
: : : :
_LIGAMENT
• Gap ; Insertion/Deletion
Football Game
• FC Seoul
– Score 16 points
• Win 5 games, Lose 4 games, Tie 1 game
• 5*3+1*1+4*0=16
• Win/Lose, Tie = 3/0, 1
Sequence alignment
• Are two sequences related? How to score..
– Ex) ALIGNMENT and LIGAMENT
ALIGNMENT
: : :
: : : :
_LIGAMENT
• Match/Mismatch, Gap(Insertion/Deletion)
• 2/-1, -2
• 7*2+1*-1+1*-2=11
Better alignment
1st ACGGACT, 2nd ATCGGATCT
A _C_GG_ACT
: : :
::
ATCGGAT_CT
[+2/-1, -2]
5/1, 4
=5*2+1*-1+4*-2
=1
A _CGG_ACT
: :::
::
ATCGGATCT
6/1,2
=6*2+1*-1+2*-2
=7
Scoring Matrices
• Match/mismatch score
– Not bad for similar sequences
– Does not show distantly related
sequences
• Likelihood matrix
– Scores residues dependent upon
likelihood substitution is found in nature
– More applicable for amino acid
sequences
Nucleic Acid Scoring
Matrices
• Two mutation models:
– Uniform mutation rates (Jukes-Cantor)
– Two separate mutation rates (Kimura)
• Transitions (*alpha)
• Transversions (*beta)
DNA Mutations
A
G
PURINES: A, G
PYRIMIDINES C, T
Transitions: AG; CT
Transversions: AC, AT,
CG, GT
C
T
PAM1 DNA odds matrices
A. Model of uniform mutation rates among nucleotides.
A
G
T
C
A 0.99
G 0.00333 0.99
T 0.00333 0.00333
0.99
C 0.00333 0.00333
0.00333
0.99
B. Model of 3-fold higher transitions than transversions.
A
G
T
C
A 0.99
G 0.006
0.99
T 0.002
0.002
0.99
C 0.002
0.002
0.006
0.99
PAM1 DNA log-odds
matrices
A. Model of uniform mutation rates among nucleotides.
A
G
T
C
A
2
G -6
2
T
-6
-6
2
C
-6
-6
-6
2
B. Model of 3-fold higher transitions than transversions.
A
G
T
C
A
2
G -5
2
T
-7
-7
2
C
-7
-7
-5
2
PAM1 matrix
normalized probabilities multiplied by 10000
A
R
N
D
C
Q
E
G
H
I
L
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser
Thr Trp Tyr Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
9867
2
9
10
3
8
17
21
2
6
4
2
6
2
22
35
32
0
2
18
1 9913
1
0
1
10
0
0
10
3
1
19
4
1
4
6
1
8
0
1
4
1 9822
36
0
4
6
6
21
3
1
13
0
1
2
20
9
1
4
1
6
0
42 9859
0
6
53
6
4
1
0
3
0
0
1
5
3
0
0
1
1
1
0
0 9973
0
0
0
1
1
0
0
0
0
1
5
1
0
3
2
3
9
4
5
0 9876
27
1
23
1
3
6
4
0
6
2
2
0
0
1
10
0
7
56
0
35 9865
4
2
3
1
4
1
0
3
4
2
0
1
2
21
1
12
11
1
3
7 9935
1
0
1
2
1
1
3
21
3
0
0
5
1
8
18
3
1
20
1
0 9912
0
1
1
0
2
3
1
1
1
4
1
2
2
3
1
2
1
2
0
0 9872
9
2
12
7
0
1
7
0
1
33
3
1
3
0
0
6
1
1
4
22 9947
2
45
13
3
1
3
PAM250 Log odds matrix
Example
• Using PAM250, the following
alignment is found:
•F W L E V E G N S M T A P T G
•F W L D V Q G D S M T A P A G
Example
• Using PAM250, the score is calculated:
•F W L E V E G N S M T A P T G
•F W L D V Q G D S M T A P A G
• S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5
= 73
Quick Calculation
• If bit scoring system is used,
significance cutoff is:
log2(mn)
Example
• 2 Sequences, each 250 amino acids
long
• Significance:
– log2(250 * 250) = 16 bits
Significance Example
• S’ = 1/3S = 1/3 * 73 = 24.333 bits
• Significance cutoff = 16 bits
• 16 < 24.33
• Therefore, this alignment is significant
Probability of Alignment
Score
• Expected # of alignments with score at
least S (E-value):
E = Kmn e-λS
– m,n: Lengths of sequences
– K ,λ: natural scales
• Search space size
• Scoring system
• For PAM250, K = 0.09;  = 0.229
P-Value
• P-Value: probability of obtaining a
given score at random
P=1–
-E
e
Which is approximately
e-E
Thank you for your attention ~
Standard Codes (IUPAC)
A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S=GC
W=AT
B=GTC
D=GAT
H=ACT
V=GCA
N = A G C T (any)
Standard IUPAC Codes
A
R
N
D
C
Q
E
G
H
I
L
K
M
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamine
Glutamic acid
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
F
P
S
T
W
Y
V
B
Phe Phenylalanine
Pro Proline
Ser Serine
Thr Threonine
Trp Tryptophan
Tyr Tyrosine
Val Valine
Asx Aspartic acid or
Asparagine
Z Glx Glutamine or Glutamic
acid
X Xaa or Xxx Any amino
acid
Related documents