Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics for Vet. Part V Sung Youn Lee, PhD. Student Veterinary collage, Room 320 02 450 3719, 016 293 6059 [email protected] Sequence alignment • Are two sequences related? – Ex) PEAR and TEAR PEAR : : : TEAR • If they are related, they might be functionally, structurally and/or evolutionally related. Sequence alignment • Are two sequences related? – Ex) ALIGNMENT and LIGAMENT ALIGNMENT : : : : : : : _LIGAMENT • Gap ; Insertion/Deletion Football Game • FC Seoul – Score 16 points • Win 5 games, Lose 4 games, Tie 1 game • 5*3+1*1+4*0=16 • Win/Lose, Tie = 3/0, 1 Sequence alignment • Are two sequences related? How to score.. – Ex) ALIGNMENT and LIGAMENT ALIGNMENT : : : : : : : _LIGAMENT • Match/Mismatch, Gap(Insertion/Deletion) • 2/-1, -2 • 7*2+1*-1+1*-2=11 Better alignment 1st ACGGACT, 2nd ATCGGATCT A _C_GG_ACT : : : :: ATCGGAT_CT [+2/-1, -2] 5/1, 4 =5*2+1*-1+4*-2 =1 A _CGG_ACT : ::: :: ATCGGATCT 6/1,2 =6*2+1*-1+2*-2 =7 Scoring Matrices • Match/mismatch score – Not bad for similar sequences – Does not show distantly related sequences • Likelihood matrix – Scores residues dependent upon likelihood substitution is found in nature – More applicable for amino acid sequences Nucleic Acid Scoring Matrices • Two mutation models: – Uniform mutation rates (Jukes-Cantor) – Two separate mutation rates (Kimura) • Transitions (*alpha) • Transversions (*beta) DNA Mutations A G PURINES: A, G PYRIMIDINES C, T Transitions: AG; CT Transversions: AC, AT, CG, GT C T PAM1 DNA odds matrices A. Model of uniform mutation rates among nucleotides. A G T C A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99 B. Model of 3-fold higher transitions than transversions. A G T C A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99 PAM1 DNA log-odds matrices A. Model of uniform mutation rates among nucleotides. A G T C A 2 G -6 2 T -6 -6 2 C -6 -6 -6 2 B. Model of 3-fold higher transitions than transversions. A G T C A 2 G -5 2 T -7 -7 2 C -7 -7 -5 2 PAM1 matrix normalized probabilities multiplied by 10000 A R N D C Q E G H I L Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 PAM250 Log odds matrix Example • Using PAM250, the following alignment is found: •F W L E V E G N S M T A P T G •F W L D V Q G D S M T A P A G Example • Using PAM250, the score is calculated: •F W L E V E G N S M T A P T G •F W L D V Q G D S M T A P A G • S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73 Quick Calculation • If bit scoring system is used, significance cutoff is: log2(mn) Example • 2 Sequences, each 250 amino acids long • Significance: – log2(250 * 250) = 16 bits Significance Example • S’ = 1/3S = 1/3 * 73 = 24.333 bits • Significance cutoff = 16 bits • 16 < 24.33 • Therefore, this alignment is significant Probability of Alignment Score • Expected # of alignments with score at least S (E-value): E = Kmn e-λS – m,n: Lengths of sequences – K ,λ: natural scales • Search space size • Scoring system • For PAM250, K = 0.09; = 0.229 P-Value • P-Value: probability of obtaining a given score at random P=1– -E e Which is approximately e-E Thank you for your attention ~ Standard Codes (IUPAC) A = adenine C = cytosine G = guanine T = thymine U = uracil R = G A (purine) Y = T C (pyrimidine) K = G T (keto) M = A C (amino) S=GC W=AT B=GTC D=GAT H=ACT V=GCA N = A G C T (any) Standard IUPAC Codes A R N D C Q E G H I L K M Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Alanine Arginine Asparagine Aspartic acid Cysteine Glutamine Glutamic acid Glycine Histidine Isoleucine Leucine Lysine Methionine F P S T W Y V B Phe Phenylalanine Pro Proline Ser Serine Thr Threonine Trp Tryptophan Tyr Tyrosine Val Valine Asx Aspartic acid or Asparagine Z Glx Glutamine or Glutamic acid X Xaa or Xxx Any amino acid