Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein diversity and scoring matrices Oct. 10, 2012 Quiz 1-today Learning objectives-General concepts of the molecular basis of evolution. The development of scoring matrices. Homework assignment—do problems 1-5 in Chapter 4. Evolutionary Basis of Sequence Alignment Why are there regions of identity when comparing protein sequences? 1) Conserved function-amino acid residues participate in reaction. 2) Structural (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene. library.thinkquest.org/19012/treeolif.htm Conserved regions within a protein. Molecular evolution and cancer DNA damage Mutation Positive selection Negative selection (purifying mutations) Neutral mutations Mutations in p53 gene in cancers Duplicated domains Retrovirus Host cell viral DNA integrated into genome 5’LTR 3’LTR Identity Matrix A C I L 1 0 0 0 A 1 0 1 0 0 C I 1 L Simplest type of scoring matrix Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. +NH 3 CO2- +NH 3 CO2- Isoleucine Leucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between? One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity. BLOSUM62 scoring matrix BLOSUM Scoring Matrices Which BLOSUM Matrix to use? BLOSUM 80 62 35 Identity (up to) 80% 62% (usually default value) 35% If you are comparing sequences that are very similar, use BLOSUM 80. Sequences that are more divergent (dissimilar) than 20% are given very low scores in this matrix. Which Scoring Matrix to use? PAM-1 BLOSUM-100 Small evolutionary distance High identity within short sequences PAM-250 BLOSUM-20 Large evolutionary distance Low identity within long sequences P53 protein binds DNA 5'-PuPuPuC(A/T)(T/A)GPyPyPy-0-13 nucleotidesPuPuPuC(A/T)(T/A)GPyPyPy-3' human placozoa human 1 MEEPQSDPSVEPPLSQETF----SDLWKLL-------------PENNVLS .|.||.|||.:| |..|:|: .|...:. 1 -------MSDEPTLSQLSFSQELSSSWQLMIDEITQGKFNTNEDEGTAIY 34 PLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMP-EAAPPVAPAP----....|..||..|...:..|:.:......:..::| |.|....|:| 44 SYSEQNPDDRYLMRPNEPQYISAGYPDGQVGQLPREFAVNQIPSPRTFSD 78 ---------------AAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLG .|....:...:|......|:||...|.|::||.:. 94 NVSSSADKAREAYYGQAVNGVSAETSPPLKRDPSLPSNAEYIGNFGFDIA 113 F-LHSGTAKSVTCTYSPALNKMFCQLAKTCPV-----------------. .:....|:...|||..|.|:|.::....|: 144 IDQNDNPTKATNNTYSTMLKKLFIKMECLFPIHITIERMDYTFKIAYGSL 144 -------QLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSD ||.:...||..:.:||..:|.:.|.:.|.|||||:| ...|.. 194 ATRRNCNQLIIPGEPPANSYIRAYVMYTKPQDVYEPVRRCPNH-ALRDQG 187 GLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNY ......|::|.|.. |.||.:| .:.||||.|||..|.||...:|:.|.: 243 KYESSDHILRCESQ-RAEYYED-TSGRHSVRVPYTAPAVGELRSTLLYQF 237 MCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEE ||.|||.|.:|||||..:||||:.: |:|||...||||||||||| |:.| 291 MCFSSCSGSINRRPIELVITLENGT-NVLGRKKVEVRVCACPGRD-RSNE 287 ENLRKKGEPHHELPP-------------------GSTKRALPNNTSSSPQ |....|.|..|:.|| ..:||.:..:..|:. 339 ERAAMKSEKEHKQPPNKKLKTSKTVSREVTGVISNESKRIMKRSVESTS318 PKKKPLDGEYFTLQIRGRERFEMFRELNEALE----LKDAQAG--KEPGG :.:.||:.:|||:.:|:..:::|:|| |.|||.. |..|. 388 ------NDDVFTITVRGRKNYEILAKMSESLEVLDKLSDAQINEIKSHGT 362 -----SRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD------------.|.::..|..::.::....:..:...|..|.. 432 LTAPLERTNTEELVRRQSRNLDTLQNAVTTKENSDGADLNLSISRWLSNI 394 -------------------------------------------------- placozoa human 482 NMEKYTQEFIKHGFKVCGHLANVSYSDMKKIIKNMEDCKKISAYLLESNF 394 --------------------------------------------393 placozoa 532 SSGNEEDIPCSQIGNSFRASQMSMNSTASQELDITRFTLRQTITL 33 Human p53 and placozoa p53 placozoa human placozoa human placozoa human placozoa human placozoa human placozoa human placozoa human placozoa human placozoa human 576 43 77 93 112 143 143 193 186 242 236 290 286 338 317 387 361 431 393 481 393 531 Scoring Matrices 1) 2) 3) April 23, 2009 Learning objectivesLast word on Global Alignment Understand how the Smith-Waterman algorithm can be applied to perform local alignment. Have a general understanding about PAM and BLOSUM scoring matrices. Homework 3 and 4 due today Quiz 1 today Writing topic due today Homework 5 due Thursday, April 30. Global Alignment output file Global: HBA_HUMAN vs HBB_HUMAN Score: 290.50 HBA_HUMAN 1 HBB_HUMAN 1 HBA_HUMAN 45 HBB_HUMAN 44 HBA_HUMAN 84 HBB_HUMAN 89 HBA_HUMAN 129 HBB_HUMAN 134 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44 |:| :|: | | |||| : | | ||| |: : :| |: :| VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43 HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83 | ||| |: :|| ||||| | :: :||:|:: : | SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88 SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128 |:|| || ||| ||:|| : |: || | |||| | |: | SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133 LASVSTVLTSKYR :| |: | || VAGVANALAHKYH 141 146 %id = 45.32 %similarity = 63.31 (88/139 *100) Overall %id = 43.15; Overall %similarity = 60.27 (88/146 *100) Smith-Waterman Algorithm Advances in Applied Mathematics, 2:482-489 (1981) Smith-Waterman algorithm –can be used for local alignment -Memory intensive -Common searching programs such as BLAST use SW algorithm Smith-Waterman (cont. 1) a. Initializes edges of the matrix with zeros b. It searches for sequence matches. c. Assigns a score to each pair of amino acids -uses similarity scores -uses positive scores for related residues -uses negative scores for substitutions and gaps d. Scores are summed for placement into Mi,j. If any sum result is below 0, a 0 is placed into Mi,j. e. Backtracing begins at the maximum value found anywhere in the matrix. f. Backtrace continues until the it meets an Mi,j value of 0. Smith-Waterman (cont. 2) H E A G A W G H E E P A W H E A E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 10 2 0 0 1 12182214 6 2 16 8 0 0 4101828 20 0 82113 5 0 41020 27 0 6131912 4 0 416 26 Put zeros on top row and left column. Assign initial scores based on a scoring matrix. Calculate new scores based on adjacent cell scores. If sum is less than zero or equal to zero begin new scoring with next cell. This example uses the BLOSUM45 Scoring Matrix with a gap penalty of -8. Smith-Waterman (cont. 3) H E A G A W G H E E P A W H E A E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 10 2 0 0 1 12182214 6 2 16 8 0 0 4101828 20 0 82113 5 0 41020 27 0 6131912 4 0 416 26 AWGHE || || AW-HE Score=28 Begin backtrace at the maximum value found anywhere on the matrix. Continue the backtrace until score falls to zero Smith-Waterman (cont. 3) H E A G A W G H E E P A W H E A E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 10 2 0 0 1 12182214 6 2 16 8 0 0 4101828 20 0 82113 5 0 41020 27 0 6131912 4 0 416 26 AWGHE || || AW-HE Score=28 Begin backtrace at the maximum value found anywhere on the matrix. Continue the backtrace until score falls to zero Calculation of similarity score and percent similarity A W G H E A W - H E 5 15 -8 10 6 Blosum45 SCORES GAP PENALTY (novel) % SIMILARITY = NUMBER OF POS. SCORES DIVIDED BY NUMBER OF AAs IN REGION x 100 % SIMILARITY = 4/5 x 100 = 80% Similarity Score= 28 Why search sequence databases? 1. I have just sequenced something. What is known about the thing I sequenced? 2. I have a unique sequence. Does it have similarity to another gene of known function? 3. I found a new protein sequence in a lower organism. Is it similar to a protein from another species? Perfect searches for similar sequences in a database First “hit” should be an exact match. Next “hits” should contain all of the genes that are related to your gene (homologs). Next “hits” should be similar but are not homologs How does one achieve the “perfect search”? Consider the following: Scoring Matrices (PAM vs. BLOSUM) Local alignment algorithm Database Search Parameters Expect Value-change threshold for score reporting Translation-of DNA sequence into protein Filtering-remove repeat sequences Placozoa Mobile element