Download F1: Multiple alignment and its meaning

There is alignment, and there is multiple alignment… Multiple alignment - what it means (F1)  a Sequence:   an Alignment (or Pair-Wise Alignment):   1 string of nucleotide or amino acid characters 2 sequences side by side in a “good” way a Multiple (Sequence) Alignment:  3 or more sequences aligned Why multiple alignment?  Multiple alignments often say more   Of course! More data tends to say more than less data Example:   Consider 3 amino acids matched in a pair-wise alignment – Suppose 1st matches, 2nd & 3rd do not match – What can we now understand? But multiple alignment reveals that     2nd varies greatly across homologs 1st is conserved across homologs 3rd is conserved across only some homologs What can we now understand? Inference from multiple alignments  See Fig. 1 p. 84, Westhead et al.  What is conserved across all homologs?  What might explain such conservation? Inference from multiple alignments II  See Westhead Fig. 1 p. 84 again  It is a serine protease –    what is a protease? Why is it useful? What is conserved across all homologs? What might explain such conservation?   Preservation of _______ Preservation of _______ Inference from multiple alignments II  See Westhead Fig. 1 p. 84 again  It is a serine protease –    what is a protease? Why is it useful? What is conserved across all homologs? What might explain such conservation?   Preservation of protein function Preservation of protein structure Inference from multiple alignments II     Preservation of protein function Preservation of protein structure Active sites (i.e. binding sites) are hard to mutate successfully (why?) Structure-defining amino acids are hard to change successfully (why?) How to do multiple alignment?  Answer 1: use existing software   “Clustal” is one Answer 2: understand an algorithm  Clustal uses “progressive alignment” • What is the “technician” answer? • What is the “scientist” answer? • Which is “better”? Progressive alignment   Start with a set of sequences to align Repeat  Find the most closely related pair of sequences   Align them as well as possible     Use a fast, approximate method like FASTA Use a slow, high-quality method like Smith-Waterman Add to (or start) a phylogenetic tree Delete the pair from the set, add their merge Until done Progressive alignment (single linkage variation)   Start with a set of sequences to align Repeat  Find the most closely related pair of sequences    Align them as well as possible     Use a fast, approximate method like FASTA How do we define “most closely related”? Use a slow, high-quality method like Smith-Waterman Add to (or start) a phylogenetic tree Delete the pair from the set, add their merge Until done Progressive alignment (you can design your own variation)   Start with a set of sequences to align Repeat Find the most closely related pair of sequences (How to define “most closely related”?)    Align them as well as possible     Use a fast, approximate method like FASTA Use a slow, high-quality method like Smith-Waterman Add to (or start) a phylogenetic tree Delete the pair from the set, add their merge Until done Progressive alignment II  Let’s do an example   Note the phylogenetic tree that results!   Name a few organisms and we’ll assume results of applying the component algorithms as needed Cool What could go wrong, algorithmically? MSA and databases  Obtaining multiple sequence alignments is computationally expensive   What if there are a dozen proteins to align? So, we’d like to store results in a DB   (F2) …so we don’t have to reinvent the wheel every time Consensus sequences is one way Consensus Sequences  Given a multiple alignment   That is, a set of aligned strings (Fig. 1 p. 84, Westhead et al.) Store a summarizing “consensus sequence”  Use ‘x’ for places which lack consensus    Consensus: e.g. 30% or more of the sequences agree What is the consensus sequence for Fig. 1? Do consensus sequences lose information? Consensus Sequences (cont.)  Given a multiple alignment   Store a single summarizing “sequence”    That is, a set of aligned strings (fig 1 p. 84) Use ‘x’ for places which lack consensus What is the consensus sequence for fig. 1? Consensus sequences throw away a lot of info, so a better solution is needed PROSITE    A database of multiple alignments – See the consensus textbook - wikipedia Alignments are described more flexibly than consensus sequences Examples (from p. 88, Westhead et al.)      [LIVM]-[ST]-A-[STAG]-H-C …[GSTAPIMVQH]-x(2)-G-[DE]… N-{P}-[ST]-{P} Why the “{P}”s? Is any information lost? PROSITE   Alignments are described more flexibly than consensus sequences Examples (from p. 88, Westhead et al.)      [LIVM]-[ST]-A-[STAG]-H-C …[GSTAPIMVQH]-x(2)-G-[DE]… N-{P}-[ST]-{P} Why the “{P}”s? A limitation (info lost):  no proportions associated with the variations Consensus sequence notations  Consider the examples (from p. 88, Westhead et al.)     [LIVM]-[ST]-A-[STAG]-H-C …[GSTAPIMVQH]-x(2)-G-[DE]… N-{P}-[ST]-{P} Write down a hypothetical PROSITE sequence and let’s all decode it… PRINTS, BLOCKS   These are also multiple alignment databases Consider a family of related proteins      Some regions are likely highly conserved PRINTS calls these motifs BLOCKS calls these, uh, blocks No gaps allowed! (Prosite permits x(2,4)) A set of motifs for a family is a fingerprint  So gaps come into play to give fingerprints PRINTS, BLOCKS   Consider a family of related proteins A set of highly conserved regions (motifs) for a family is a fingerprint   So gaps come into play to give fingerprints Why not allow gaps in a motif? Families of protein domains  Even PRINTS, BLOCKS, & PROSITE…   (F3) …have too little information about families Proteins tend to be built of domains  A domain is a chunk or “module” that is in many different proteins  The fact that proteins share a domain makes them related:  They are related by virtue of sharing domain x Protein domain families  Many proteins share e.g. the PH domain   This domain’s sequence details vary …but are lumped into the PH domain family   PH – Pleckstrin Homology A subsequence from a protein matches a given domain better or worse  If it matches well enough the subsequence is in that domain family    (see e.g. fig 1 p. 93, Westhead et al.) What is the score of one of the sequences? A random sequence? How could we make a cladogram from the figure? Reference: The Amino Acid Abbreviations             Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys A R N D C Q E G H I L K Alanine  Arginine  Asparagine  Aspartic acid (Aspartate)   Cysteine  Glutamine Glutamic acid (Glutamate)   Glycine  Histidine Isoleucine  Leucine  Lysine  Met M Phe F Pro P Ser S Thr T Trp W Tyr Y Val V Asx B Asparagine Glx Z acid Xaa X TERM Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Aspartic acid or Glutamine or Glutamic Any amino acid termination codon Let’s Review CATH and SCOP  …since we had rushed it (lecture11notes.pdf)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download F1: Multiple alignment and its meaning