Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Point mutation wikipedia , lookup
Human–animal hybrid wikipedia , lookup
Metagenomics wikipedia , lookup
Human genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Scoring Matrices 1) 2) 3) April 23, 2009 Learning objectivesLast word on Global Alignment Understand how the Smith-Waterman algorithm can be applied to perform local alignment. Have a general understanding about PAM and BLOSUM scoring matrices. Homework 3 and 4 due today Quiz 1 today Writing topic due today Homework 5 due Thursday, April 30. Global Alignment output file Global: HBA_HUMAN vs HBB_HUMAN Score: 290.50 HBA_HUMAN 1 HBB_HUMAN 1 HBA_HUMAN 45 HBB_HUMAN 44 HBA_HUMAN 84 HBB_HUMAN 89 HBA_HUMAN 129 HBB_HUMAN 134 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44 |:| :|: | | |||| : | | ||| |: : :| |: :| VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43 HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83 | ||| |: :|| ||||| | :: :||:|:: : | SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88 SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128 |:|| || ||| ||:|| : |: || | |||| | |: | SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133 LASVSTVLTSKYR :| |: | || VAGVANALAHKYH 141 146 %id = 45.32 %similarity = 63.31 (88/139 *100) Overall %id = 43.15; Overall %similarity = 60.27 (88/146 *100) Smith-Waterman Algorithm Advances in Applied Mathematics, 2:482-489 (1981) Smith-Waterman algorithm –can be used for local alignment -Memory intensive -Common searching programs such as BLAST use SW algorithm Smith-Waterman (cont. 1) a. Initializes edges of the matrix with zeros b. It searches for sequence matches. c. Assigns a score to each pair of amino acids -uses similarity scores -uses positive scores for related residues -uses negative scores for substitutions and gaps d. Scores are summed for placement into Mi,j. If any sum result is below 0, a 0 is placed into Mi,j. e. Backtracing begins at the maximum value found anywhere in the matrix. f. Backtrace continues until the it meets an Mi,j value of 0. Smith-Waterman (cont. 2) H E A G A W G H E E P A W H E A E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 10 2 0 0 1 12182214 6 2 16 8 0 0 4101828 20 0 82113 5 0 41020 27 0 6131912 4 0 416 26 Put zeros on top row and left column. Assign initial scores based on a scoring matrix. Calculate new scores based on adjacent cell scores. If sum is less than zero or equal to zero begin new scoring with next cell. This example uses the BLOSUM45 Scoring Matrix with a gap penalty of -8. Smith-Waterman (cont. 3) H E A G A W G H E E P A W H E A E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 10 2 0 0 1 12182214 6 2 16 8 0 0 4101828 20 0 82113 5 0 41020 27 0 6131912 4 0 416 26 AWGHE || || AW-HE Score=28 Begin backtrace at the maximum value found anywhere on the matrix. Continue the backtrace until score falls to zero Calculation of similarity score and percent similarity A W G H E A W - H E 5 15 -8 10 6 Blosum45 SCORES GAP PENALTY (novel) % SIMILARITY = NUMBER OF POS. SCORES DIVIDED BY NUMBER OF AAs IN REGION x 100 % SIMILARITY = 4/5 x 100 = 80% Similarity Score= 28 Why search sequence databases? 1. I have just sequenced something. What is known about the thing I sequenced? 2. I have a unique sequence. Does it have similarity to another gene of known function? 3. I found a new protein sequence in a lower organism. Is it similar to a protein from another species? Perfect searches for similar sequences in a database First “hit” should be an exact match. Next “hits” should contain all of the genes that are related to your gene (homologs). Next “hits” should be similar but are not homologs How does one achieve the “perfect search”? Consider the following: Scoring Matrices (PAM vs. BLOSUM) Local alignment algorithm Database Search Parameters Expect Value-change threshold for score reporting Translation-of DNA sequence into protein Filtering-remove repeat sequences Which Scoring Matrix to use? PAM-1 BLOSUM-100 Small evolutionary distance High identity within short sequences PAM-250 BLOSUM-20 Large evolutionary distance Low identity within long sequences BLOSUM Scoring Matrices Which BLOSUM Matrix to use? BLOSUM 80 62 35 Identity (up to) 80% 62% (usually default value) 35% If you are comparing sequences that are very similar, use BLOSUM 80. Sequences that are more divergent (dissimilar) than 20% are given very low scores in this matrix.