* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download F1: Multiple alignment and its meaning
Survey
Document related concepts
Endogenous retrovirus wikipedia , lookup
Interactome wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Metalloprotein wikipedia , lookup
Magnesium transporter wikipedia , lookup
Western blot wikipedia , lookup
Biosynthesis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Proteolysis wikipedia , lookup
Protein structure prediction wikipedia , lookup
Transcript
There is alignment, and there is multiple alignment… Multiple alignment - what it means (F1) a Sequence: an Alignment (or Pair-Wise Alignment): 1 string of nucleotide or amino acid characters 2 sequences side by side in a “good” way a Multiple (Sequence) Alignment: 3 or more sequences aligned Why multiple alignment? Multiple alignments often say more Of course! More data tends to say more than less data Example: Consider 3 amino acids matched in a pair-wise alignment – Suppose 1st matches, 2nd & 3rd do not match – What can we now understand? But multiple alignment reveals that 2nd varies greatly across homologs 1st is conserved across homologs 3rd is conserved across only some homologs What can we now understand? Inference from multiple alignments See Fig. 1 p. 84, Westhead et al. What is conserved across all homologs? What might explain such conservation? Inference from multiple alignments II See Westhead Fig. 1 p. 84 again It is a serine protease – what is a protease? Why is it useful? What is conserved across all homologs? What might explain such conservation? Preservation of _______ Preservation of _______ Inference from multiple alignments II See Westhead Fig. 1 p. 84 again It is a serine protease – what is a protease? Why is it useful? What is conserved across all homologs? What might explain such conservation? Preservation of protein function Preservation of protein structure Inference from multiple alignments II Preservation of protein function Preservation of protein structure Active sites (i.e. binding sites) are hard to mutate successfully (why?) Structure-defining amino acids are hard to change successfully (why?) How to do multiple alignment? Answer 1: use existing software “Clustal” is one Answer 2: understand an algorithm Clustal uses “progressive alignment” • What is the “technician” answer? • What is the “scientist” answer? • Which is “better”? Progressive alignment Start with a set of sequences to align Repeat Find the most closely related pair of sequences Align them as well as possible Use a fast, approximate method like FASTA Use a slow, high-quality method like Smith-Waterman Add to (or start) a phylogenetic tree Delete the pair from the set, add their merge Until done Progressive alignment (single linkage variation) Start with a set of sequences to align Repeat Find the most closely related pair of sequences Align them as well as possible Use a fast, approximate method like FASTA How do we define “most closely related”? Use a slow, high-quality method like Smith-Waterman Add to (or start) a phylogenetic tree Delete the pair from the set, add their merge Until done Progressive alignment (you can design your own variation) Start with a set of sequences to align Repeat Find the most closely related pair of sequences (How to define “most closely related”?) Align them as well as possible Use a fast, approximate method like FASTA Use a slow, high-quality method like Smith-Waterman Add to (or start) a phylogenetic tree Delete the pair from the set, add their merge Until done Progressive alignment II Let’s do an example Note the phylogenetic tree that results! Name a few organisms and we’ll assume results of applying the component algorithms as needed Cool What could go wrong, algorithmically? MSA and databases Obtaining multiple sequence alignments is computationally expensive What if there are a dozen proteins to align? So, we’d like to store results in a DB (F2) …so we don’t have to reinvent the wheel every time Consensus sequences is one way Consensus Sequences Given a multiple alignment That is, a set of aligned strings (Fig. 1 p. 84, Westhead et al.) Store a summarizing “consensus sequence” Use ‘x’ for places which lack consensus Consensus: e.g. 30% or more of the sequences agree What is the consensus sequence for Fig. 1? Do consensus sequences lose information? Consensus Sequences (cont.) Given a multiple alignment Store a single summarizing “sequence” That is, a set of aligned strings (fig 1 p. 84) Use ‘x’ for places which lack consensus What is the consensus sequence for fig. 1? Consensus sequences throw away a lot of info, so a better solution is needed PROSITE A database of multiple alignments – See the consensus textbook - wikipedia Alignments are described more flexibly than consensus sequences Examples (from p. 88, Westhead et al.) [LIVM]-[ST]-A-[STAG]-H-C …[GSTAPIMVQH]-x(2)-G-[DE]… N-{P}-[ST]-{P} Why the “{P}”s? Is any information lost? PROSITE Alignments are described more flexibly than consensus sequences Examples (from p. 88, Westhead et al.) [LIVM]-[ST]-A-[STAG]-H-C …[GSTAPIMVQH]-x(2)-G-[DE]… N-{P}-[ST]-{P} Why the “{P}”s? A limitation (info lost): no proportions associated with the variations Consensus sequence notations Consider the examples (from p. 88, Westhead et al.) [LIVM]-[ST]-A-[STAG]-H-C …[GSTAPIMVQH]-x(2)-G-[DE]… N-{P}-[ST]-{P} Write down a hypothetical PROSITE sequence and let’s all decode it… PRINTS, BLOCKS These are also multiple alignment databases Consider a family of related proteins Some regions are likely highly conserved PRINTS calls these motifs BLOCKS calls these, uh, blocks No gaps allowed! (Prosite permits x(2,4)) A set of motifs for a family is a fingerprint So gaps come into play to give fingerprints PRINTS, BLOCKS Consider a family of related proteins A set of highly conserved regions (motifs) for a family is a fingerprint So gaps come into play to give fingerprints Why not allow gaps in a motif? Families of protein domains Even PRINTS, BLOCKS, & PROSITE… (F3) …have too little information about families Proteins tend to be built of domains A domain is a chunk or “module” that is in many different proteins The fact that proteins share a domain makes them related: They are related by virtue of sharing domain x Protein domain families Many proteins share e.g. the PH domain This domain’s sequence details vary …but are lumped into the PH domain family PH – Pleckstrin Homology A subsequence from a protein matches a given domain better or worse If it matches well enough the subsequence is in that domain family (see e.g. fig 1 p. 93, Westhead et al.) What is the score of one of the sequences? A random sequence? How could we make a cladogram from the figure? Reference: The Amino Acid Abbreviations Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys A R N D C Q E G H I L K Alanine Arginine Asparagine Aspartic acid (Aspartate) Cysteine Glutamine Glutamic acid (Glutamate) Glycine Histidine Isoleucine Leucine Lysine Met M Phe F Pro P Ser S Thr T Trp W Tyr Y Val V Asx B Asparagine Glx Z acid Xaa X TERM Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Aspartic acid or Glutamine or Glutamic Any amino acid termination codon Let’s Review CATH and SCOP …since we had rushed it (lecture11notes.pdf)