Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Alineamiento Múltiple de secuencias Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins • It is the basis of BLAST searching • It is used in the analysis of genomes Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments Page 54 Pairwise alignment: protein sequences can be more informative than DNA • DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Definitions Homology Similarity attributed to descent from a common ancestor. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. RBP: glycodelin: 26 23 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + + GTW++ MA + L + A QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V LHRWEN 81 Page 44 Definitions: two types of homology Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Page 43 common carp zebrafish rainbow trout teleost Orthologs: members of a gene (protein) family in various organisms. This tree shows RBP orthologs. African clawed frog chicken human mouse rat horse pig cow rabbit 10 changes Page 43 apolipoprotein D retinol-binding protein 4 Complement component 8 Alpha-1 Microglobulin /bikunin Paralogs: members of a gene (protein) family within a species prostaglandin D2 synthase progestagenassociated endometrial protein Odorant-binding protein 2A neutrophil gelatinaseassociated lipocalin Lipocalin 1 10 changes Page 44 Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Page 46 retinol-binding protein (NP_006735) b-lactoglobulin (P02754) Page 42 Definitions Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Identity (bar) 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Page 46 Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Somewhat similar (one dot) Very similar (two dots) 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Page 46 Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Internal gap Terminal gap Page 46 Multiple sequence alignment of ‘ortologues’ glyceraldehyde 3-phosphate dehydrogenases fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD.APM..F SAD.APM..F SAD.APM..F SKDNTPM..F SS.TAPM..F PKGDEPVKQL VCGVNLDAYK VMGVNHEKYD VVGVNEHTYQ VKGANFDKY. VMGVNEEKYT VYGVNHDEYD PDMKVVSNAS NSLKIISNAS PNMDIVSNAS AGQDIVSNAS SDLKIVSNAS GE.DVVSNAS CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNCLAPLA CTTNSITPVA fly human plant bacterium yeast archaeon KVINDNFEIV KVIHDNFGIV KVVHEEFGIL KVINDNFGII KVINDAFGIE KVLDEEFGIN EGLMTTVHAT EGLMTTVHAI EGLMTTVHAT EGLMTTVHAT EGLMTTVHSL AGQLTTVHAY TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TATQKTVDGP TGSQNLMDGP SGKLWRDGRG SGKLWRDGRG SMKDWRGGRG SHKDWRGGRG SHKDWRGGRT NGKP.RRRRA AAQNIIPAST ALQNIIPAST ASQNIIPSST ASQNIIPSST ASGNIIPSST AAENIIPTST fly human plant bacterium yeast archaeon GAAKAVGKVI GAAKAVGKVI GAAKAVGKVL GAAKAVGKVL GAAKAVGKVL GAAQAATEVL PALNGKLTGM PELNGKLTGM PELNGKLTGM PELNGKLTGM PELQGKLTGM PELEGKLDGM AFRVPTPNVS AFRVPTANVS AFRVPTSNVS AFRVPTPNVS AFRVPTVDVS AIRVPVPNGS VVDLTVRLGK VVDLTCRLEK VVDLTCRLEK VVDLTVRLEK VVDLTVKLNK ITEFVVDLDD GASYDEIKAK PAKYDDIKKV GASYEDVKAA AATYEQIKAA ETTYDEIKKV DVTESDVNAA Page 49 Multiple sequence alignment of human lipocalin ‘paralogs’ ~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD... lipocalin 1 odorant-binding protein 2a progestagen-assoc. endo. apolipoprotein D retinol-binding protein neutrophil gelatinase-ass. prostaglandin D2 synthase alpha-1-microglobulin complement component 8 Page 49 Calculation of an alignment score PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity) Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** A PAM250 matrix is very tolerant of mismatches. 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** ** Page 60 BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for blocks substitution matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Page 60 Rat versus mouse RBP Rat versus bacterial lipocalin PAM matrices reflect different degrees of divergence PAM250 Ancestral sequence ACCCTAC A C C C --> G T --> A A --> C --> T C no change single substitution multiple substitutions coincidental substitutions parallel substitutions convergent substitutions back substitution Sequence 1 A C --> A C --> A --> T C --> A T --> A A --> T C --> T --> C Sequence 2 Li (1997) p.70 homologous sequences non-homologous sequences Sequences reported as related True positives False positives Sequences reported as unrelated False negatives True negatives Sensitivity: ability to find true positives Specificity: ability to minimize false positives Outline -Why Do We Need Multiple Sequence Alignment ? -The progressive Alignment Algorithm -A possible Strategy… -Potential Difficulties Pre-requisite -How Do Sequences Evolve? -How can We COMPARE Sequences ? -How can We ALIGN Sequences ? What is A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Structural Criteria: Residues are arranged so that those playing a similar role end up in the same column. Evolution Criteria: Residues are arranged so that those having the same ancestor end up in the same column. Phylogenic Relation Functional Relation How Can I Use A Multiple Sequence Alignment? chite wheat trybr unknown ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr unknown AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Less Than 30 % id BUT Conserved where it MATTERS Extrapolation Beyond The Twilight Zone Homology? SwissProt Unkown Sequence How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Prosite Patterns P-K-R-[PA]-x(1)-[ST]… How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Prosite Patterns Profiles And HMMs L? K>R A F D E F G H Q I V L W -More Sensitive -More Specific How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Profiles Phylogeny chite wheat trybr mouse -Evolution -Paralogy/Orthology How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Profiles Phylogeny Struc. Prediction Column Constraint Evolution Constraint Structure Constraint How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Motifs/Patterns Profiles Phylogeny Struc. Prediction PsiPred OR PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good. How Can I Use A Multiple Sequence Alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Automatic Multiple Sequence Alignment methods are not always perfect… You know better… With your big BRAIN Why Is It Difficult To Compute A multiple Sequence Alignment? A CROSSROAD PROBLEM BIOLOGY: What is A Good Alignment chite wheat trybr mouse COMPUTATION What is THE Good Alignment ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * The Biological Problem. How to Evaluate an Alignment -A nice set of Sequences -Substitution Matrix (Blosum) -Gap Penalties. -An Evaluation Function A A A C C A A A C Sums of Pairs: Cost=6 C Over-estimation of the Substitutions Easy to compute The COMPUTATIONAL Problem. Producing the Alignment -A nice set of Sequences -Substitution Matrix (Blosum) -Gap Penalties. -An Evaluation Function -An Alignment Algorithm Will It Work ? GLOBAL Alignment HOW CAN I ALIGN MANY SEQUENCES 2 Globins =>1 Min HOW CAN I ALIGN MANY SEQUENCES 3 Globins =>2 hours HOW CAN I ALIGN MANY SEQUENCES 4 Globins => 10 days HOW CAN I ALIGN MANY SEQUENCES 5 Globins => 3 years HOW CAN I ALIGN MANY SEQUENCES 6 Globins =>300 years HOW CAN I ALIGN MANY SEQUENCES 7 Globins =>30. 000 years Solidified Fossil, Old stuff HOW CAN I ALIGN MANY SEQUENCES 8 Globins =>3 Million years The Progressive Multiple Alignment Algorithm (Clustal W) Making An Alignment Any Exact Method would be TOO SLOW We will use a Heuristic Algorithm. Progressive Alignment Algorithm is the most Popular -ClustalW -Greedy Heuristic (No Guarranty). -Fast Progressive Alignment Feng and Dolittle, 1988 Clustering Progressive Alignment Dynamic Programming Using A Substitution Matrix Progressive Alignment -Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). -Depends on the PARAMETERS: •Substitution Matrix. •Penalties (Gop, Gep). •Sequence Weight. •Tree making Algorithm. Progressive Alignment When Does It Work Works Well When Phylogeny is Dense No outlayer Sequence. Image: River Crossing Progressive Alignment When Doesn’t It Work CLUSTALW (Score=20, Gop=-1, Gep=0, M=1) SeqA SeqB SeqC SeqD GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T CA-T FAST FA-T CAT --CAT CAT LAST FAST VERY ---- FA-T ---FAST FA-T CAT CAT CAT CAT CORRECT (Score=24) SeqA SeqB SeqC SeqD GARFIELD GARFIELD GARFIELD -------- THE THE THE THE GARFIELD THE LAST FAT CAT GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT --- GARFIELD THE FAST CAT GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T CA-T FAST FA-T CAT --CAT CAT GARFIELD THE VERY FAST CAT GARFIELD THE VERY FAT CAT -------- THE ---- FAT CAT THE FAT CAT Building the Right Multiple Sequence Alignment. Recognizing The Right Sequences When you Meet Them… Gathering Sequences: BLAST Common Mistake: Sequences Too Closely Related PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:** -IDENTICAL SEQUENCES BRING NO INFORMATION FOR THE MULTIPLE SEQUENCE ALIGNMENT -MULTIPLE SEQUENCE ALIGNMENTS THRIVE ON DIVERSITY… Selecting Diverse Sequences (Opus II) Respect Information! PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE ------------------------------------------SMTDLLN----AEDIKKA ------------------------------------------SMTDLLN----AEDIKKA ------------------------------------------SMTDLLS----AEDIKKA ------------------------------------------SMTDVLS----AEDIKKA ------------------------------------------SMTDLLS----AEDIKKA ------------------------------------------AMTELLN----AEDIKKA MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*. .*:::: PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM This Alignment Is not Informative about the relation Betwwen TPCC MOUSE and the rest of the sequences. -A better Spread of the Sequences is needed Selecting Diverse Sequences (Opus II) Selecting Diverse Sequences (Opus II) PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:** PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKAEDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQDEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKAQDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKAEDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** ** -A REASONABLE Model Now Exists. -Going Further:Remote Homologues. Aligning Remote Homologues PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE ------------------------------------------SMTDLLNA----EDIKKA -------------------------------------------AKDLLKA----DDIKKA ------------------------------------------AFAGVLND----ADIAAA ------------------------------------------AFAGILSD----ADIAAG -----------------------------------------MACAHLCKE----ADIKTA ------------------------------------------AVAKLLAA----ADVTAA ------------------------------------------SITDIVSE----KDIDAA -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM : . .: .. . *: * : * :* : .*:*: :** . PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEALQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE :: .. :: : :: .* :.** *. :** :: Some Guidelines … Do Not Use Two Many Sequences… Reading Your Alignment WHAT MAKES A GOOD ALIGNMENT… -THE MORE DIVERGEANT THE SEQUENCES, THE BETTER -THE FEWER INDELS, THE BETTER -NICE UNGAPPED BLOCKS SEPARATED WITH INDELS -DIFFERENT CLASSES OF RESIDUES WITHIN A BLOCK: •Completely Conserved •Conserved For Size and Hydropathy •Conserved For Size or Hydropathy -THE ULTIMATE EVALUATION IS A MATTER OF PERSONNAL JUDGEMENT AND KNOWLEDGE. Potential Difficulties DO NOT OVERTUNE!!! chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : DO NOT PLAY WITH PARAMETERS IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! chite wheat trybr mouse ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. :*: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : TUNING or NOT TUNING!!! -PARAMETERS TO TUNE USUALLY INCLUDE: •GOP/ GEP •MATRIX •SENSITIVITY Vs SPEED Substitution Matrices (Etzold and al. 1993) GOP Gonnet Blosum50 Pam250 61.7 % 59.7 % 59.2 % GEP -MOST METHODS ARE TUNED FOR WORKING WELL ON AVERAGE -PARAMETERS BEHAVIOUR DO NOT NECESSARILY FOLLOW THE THEORY (i.e. Substitution Matrices). -A GOOD ALIGNMENT IS USUALLY ROBUST(i.e. Changes little). -TUNE IF YOU WANT TO CONVINCE YOURSELF. KEEP A BIOLOGICAL PERSPECTIVE chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * DIFFERENT PARAMETERS chite wheat trybr mouse AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * *** .:: ::... : * . . . : * . *: * WRONG ALIGNMENT !!! REPEATS THERE IS A PROBLEM WHEN TWO SEQUENCES DO NOT CONTAIN THE SAME NUMBER OF REPEATS IT IS THEN BETTER TO MANUALLY EXTRACT THE REPEATS AND TO ALIGN THEM. INDIVIDUAL REPEATS CAN BE RECOGNIZED USING DOTTER Naming Your Sequences The Right Way Choosing the right Method Simultaneous Alignments : MSA 1) Set Bounds on each pair of sequences (Carillo and Lipman) 2) Compute the Maln within the Hyperspace -Few Small Closely Related Sequence. -Memory and CPU hungry -Do Well When They Can Run. Dialign II 1) Identify best chain of segments on each pair of sequence. Assign a Pvalue to each Segment Pair. 2) Ré-évaluate each segment pair according to its consistency with the others 3) Assemble the alignment according to the segment pairs. Dialign II -May Align Too Few Residues -No Gap Penalty -Does well with ESTs Iterative Methods 7.16.1 Progressive -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators Mixing Local and Global Alignments Local Alignment Global Alignment Extension Multiple Sequence Alignment WhatBaliBase Is BaliBase Source: BaliBase, Thompson et al, NAR, 1999, PROBLEM Description Even Phylogenic Spread. One Outlayer Sequence Two Distantly related Groups Long Internal Indel Long Terminal Indel WhichIs Method ? What BaliBase Source: BaliBase, Thompson et al, NAR, 1999, PROBLEM Strategy ClustalW, T-coffee, MSA, DCA T-Coffee PrrP, T-Coffee Dialign T-Coffee Dialign T-Coffee Strategy Methods /Situtations 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. -Do Well When They Can Run. 2-Segment Based: -DIALIGN, MACAW. -May Align Too Few Residues -Good For Long Indels 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inaccurate -Good Profile Generators 4-Progressive: -ClustalW, Pileup, Multalign… -Fast and Sensitive Conclusion Multiple Alignment -The BEST alignment Method: Your Brain The Right Data -The Best Evaluation Procedure: Experimental Data (SwissProt) -Choosing The Sequences Well is Important -Beware of repeated elements Editing Multiple Alignments There are a variety of tools that can be used to modify a multiple alignment. These programs can be very useful in formatting and annotating an alignment for publication. An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs. BioEdit Editors on the Web Check out CINEMA (Colour INteractive Editor for Multiple Alignments) It is an editor created completely in JAVA (old browsers beware) It includes a fully functional version of CLUSTAL, BLAST, and a DotPlot module http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.1 Addresses Some URLs EMBL-EBI http://www.ebi.ac.uk/clustalw/ BCM Search Launcher: Multiple Alignment http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html Multiple Sequence Alignment for Proteins (Wash. U. St. Louis) http://www.ibc.wustl.edu/service/msa/