Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiple Sequence Alignment Carlow IT Bioinformatics November 2006 MSA • A central technique in bioinformatics • along with: – homology searching – multiple sequence alignment – phylogenetic trees An example “all you have to do” is re-write your sequences so that similar features finish up in the same columns Evolutionary relationship • “similar features” ideally means homologous – with a shared ancestor • clustalW and T-coffee mimic the process of evolution – by weighting similar residues by how conserved they are in evolution • Important AAs don’t mutate • Less important AAs change easily, even randomly – by inserting judicious gaps Criteria for alignment • Amino acids in the same column have – Structural similarity (used by threading progs) • Practical exercise inferring position of Bsu recA AAs – Evolutionary similarity – residues have a common ancestor – Functional similarity (active site, C-C bonds) may have to hand edit known functions – Sequence similarity • The first 3 (clear biological attributes) are, you hope, reflected by the last (an abstraction) which is what MSA programs use Applications • Discover conserved patterns/motifs – A step to describing a protein domain – MSA can add a distant relative to your protein family • A step to define DNA regulatory elements. • Prediction of 2nd Structure and helps 3-D • A step to phylogenetic trees: to describe or show the process of evolution • PCR analysis/primer design – find most and least degenerate regions of your sequence So why difficult? Where put the gap? FGDERTHHS FGD-D-HRS FGDERTHHS FGD--DHRS FGDERTHHS FGDD--HRS Trivial 2 seq alignment: 3 possibilities. As length and # of seqs increase, number of possible permutations goes astronomical Some data • Cat ATGAAACGTCGGATCTAA • Dog ATGAATCGACCCATCTAA • Mus ATGGCGTGGCTTGGCATGTGA • Rat ATGGCATGTCGTGGCATGTAG Protocol step 1 • Align each pair of seqs C-D, C-M, C-R etc • Get a score for each alignment • And make a … Similarity matrix Cat Cat ID Dog Mus Rat Dog 14 ID Mus 10 10 ID Rat 10 10 16 ID • Number of identical residues – Which pair of sequences is most similar? Progressive alignment • Align the two most similar sequences, inserting any gaps. • Mus/Rat: lock these sequences together (call it “RODent) • Return to similarity matrix to find next most similar seqs or sequence cluster • Dog/Cat: align and lock (call it CARnivore) – if next step requires a gap, then gap inserted in both carnivore sequences • Align next most …(now its iterative) An alignment Cat Dog Mus Rat ATGAAACGTCGG---ATCTAA ATGAATCGACCC---ATCTAA ATGGCGTGGCTTGGCATGTGA ATGGCATGTCGTGGCATGTAG *** * * ** * • Good: Always a two “sequence” problem – So computationally possible • Bad: Can’t rewrite or decouple (part of) the dog/cat alignment in the light of later info. Locked in a (suboptimal?) trough. More complex 10 seq example Choosing the right seqs • Use MSA to inform you! • Always use AA/protein if possible – can copygaps back to DNA later • • • • • • Start with 6-15 sequences Eliminate very different (<30% id) seqs Eliminate identical sequences Watch out for partial sequences …or sequences that need ++ gaps to align Check for repeats with dotlet, Lalign Less is more • Large alignments – take ++ CPU and time – are hard to do well – are difficult to display – are difficult to use: in trees for example – may include marginal seqs that wreck whole alignment • So start small and add/eliminate seqs until you have a clear informative picture Level of variation is important • Choose sequence family with best rate of evolution for your taxonomic group – Histones evolve very slow (compare kingdoms) – Transferrins are fast (compare classes,orders) • Closely related sequences may have identical protein (but variable DNA) • Distantly related sequences no DNA signal (“saturated”) ClustalW at embnet.ch.org Paste in your FASTA sequences Output choices ClustalW at EBI Paste in your (FASTA) sequences EBI: loads of options T-coffee Minimal input parameters and STILL a better job than ClustalW Output EBI clustalW Jalview alignment editor Pairwise distance etc Alignment Guidetree What you submitted An alignment fragment ACT_CANAL ACT_CANDU ACT_PICAN ACT_PICPA ACT_KLULA ACT_YEAST ACT_YARLI ACT2_ABSGL ACT2_SCHCO -MDGEEVAALIIDNGSGMCKA -MDGEEVAALVIDNGSGMCKA -MDGEDVAALVIDNGSGMCKA -MDGEDVAALVIDNGSGMCKA -MDS-EVAALVIDNGSGMCKA -MDS-EVAALVIDNGSGMCKA -MED-ETVALVIDNGSGMCKA MSMEEDIAALVIDNASGMCKA --MDDEIQAVVIDNGSGMCKA : *:::**.****** * All AA in column identical : AA similar size & hydrophobicity . AA similar size or hydrophobicity ClustalW format The alignment, so what next? • • • • Look at it very closely Hand edit if necessary (probably) Eliminate problem sequences and redo? Use display option best for next step – Phylip format for trees Parameter changes • Substit matrix PAM, Gonnet, Blosum – Clustalw chooses which matrix within family • PAM30 for closely related pairs; PAM120; PAM250 for more distant – Difficult alignment: matrix change may help • Gap penalty (open and extend) have optimal values for each family: find which by trial and error. – Clustalw puts gaps (which are often external loops) near previous gaps (longer loop) • MSA does the grunt work. YOU do the fine tuning. Guide tree • To figure which pairs of sequences to align first, a phylogenetic tree is calculated from pairwise distance matrix. – Stored in a DND (dendrogram) file • Never use this file to draw a tree • Clustalw can construct a tree from the multiple sequence alignment (better than pairwise) Alignment display: weblogo Always remember: sequence represents a 3-D structure Patterns to recognise (more reliable in MSA than in single seq) MSA improves 2ndary structure (a-helix b-sheet) prediction by >6%) • Alternate hydrophobic residues – Surface b-sheet (zig-zag-zig-zag) • Runs of hydrophobic residues – Interior/buried b-sheet • Residues with 3.5AA spacing (amphipathic) – a-helix WNNWFNNFNNWNNNF • Gaps/indels – Probably surface not core Conserved residues • W,F,Y large hydrophobic, internal/core – conserved WFY best signal for domains • G,P turns, can mark end of a-helix b-sheet • C conserved with reliable spacing speaks C-C disulphide bridges - defensins • H,S often catalytic sites in proteases (and other enzymes) • KRDE charged: ligand binding or salt-bridge • L very common AA but not conserved – except in Leucine zipper L234567L234567L234567L Finish with an alignment: defensins 3 pairs of C residues: 3 disulphide bridges