Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genetic code wikipedia , lookup
Community fingerprinting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Molecular ecology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Point mutation wikipedia , lookup
Homologues finding and Multiple Sequence Alignment Maya Schushan November 2010 Outline- introduction to alignments 1. Introduction 2. Applications 3. General Alignment Methodology 4. Pairwise Alignment: • Smith-Waterman • Needlman-Wunch 5. Multiple Sequence Alignment: • ClustalW • MUSCLE • T-coffee Introduction What Is An Alignment? A process of lining-up 2 or more sequences to achieve maximum level of identity, in order to find homologies. TCATG CATTG ? TCATG CATTG or TCATG CATTG Introduction What Is An Alignment? • Comparing 2 (pairwise) or more (multiple) sequences. • Searching for a series of identical or similar characters in the sequences. VLSPADKTNVKAAWAKVGAHAAGHG ||| | | |||| | |||| VLSEAEWQLVLHVWAKVEADVAGHG Introduction Basic Terms • Homology: Relation of sequences which is a result of divergence from a common ancestor. • Identity: Sequences or Sub-sequences that are invariant. • Similarity: Sequences or Sub-sequences that are related. CAG CAT Introduction Homologues: Orthology vs Paralogy Reproduced from NCBI education website Introduction The Limits of Sequence Similarity Outline 1. Introduction 2. Applications 3. General Alignment Methodology 4. Pairwise Alignment: • Smith-Waterman • Needlman-Wunch 5. Multiple Sequence Alignment: • ClustalW • MUSCLE • T-coffee Applications Why Sequence Alignment? 1.Predict characteristics of a protein – VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG Applications Why Sequence Alignment? A model is generated according to a template structure of a homologous protein Applications Why Sequence Alignment? 2. Learn about evolutionary relationships – • Two sequences from different organisms are similar they may have a common ancestor. • Needed for construction of phylogenetic trees Applications Why Sequence Alignment? 3. Research of disease – • Comparison of sequences between individuals can detect changes that are related to diseases • Analysis of residues’ substitutions: mutation or polymorphism? Applications Why Sequence Alignment? 4. Find similar sequences in a database • The commonly used BLAST and FASTA search programs have to utilize a form of an alignment to detect similar sequences to the sequence in hand • The methods employed has to be very fast, to make the search in a database containing millions of sequences feasible Applications Why Sequence Alignment? Examples for specific applications: • Evolutionary conservation analysis (ConSeq/ConSurf) • Motif and domain prediction (Prosite/InterPro/Pfam) • Phylogenetic trees • … ConSurf analysis of PDB entry 1hyt-hydrolase Outline 1. Introduction 2. Applications 3. General Alignment Methodology 4. Pairwise Alignment: • Smith-Waterman • Needlman-Wunch 5. Multiple Sequence Alignment: • ClustalW • MUSCLE • T-coffee General Alignment Methodology Example: Aligning Two Globins Human Hemoglobin (HH): VLSPADKTNVKAAWGKVGAHAGYEG Sperm Whale Myoglobin (SWM): VLSEGEWQLVLHVWAKVEADVAGHG General Alignment Methodology Example: Aligning Two Globins • Percent identity: 36 • Percent similarity: 40 (HH) No Gaps: VLSPADKTNVKAAWGKVGAHAGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGHG General Alignment Methodology Example: Aligning Two Globins With Gaps: Gaps: 2 • Percent identity: 45.833 (instead of 36 without gaps) • Percent similarity: 54.167 (instead of 40 without gaps) • (HH) VLSPADKTNVKAAWGKVGAH-AGYEG (SWM) VLSEGEWQLVLHVWAKVEADVAGH-G General Alignment Methodology Sequence Modifications 1. Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA 2. Deletion - deleting a letter (or more) from the sequence. AAGA AGA 3. Substitution - replacing a sequence letter by another. AAGA AACA INEDL- Insertions + Deletions General Alignment Methodology Measuring An Alignment S = ACTG T = AGT S’ = AC_TG S’ = ACTG S’ = ACTG T’ = A_GT_ T’ = AGT_ T’ = _AGT Good: Identical characters- match. Bad: Different characters- mismatch; gap (InDel). • Each pair of characters gets a value, depending on its identity. •The similarity score of the alignment is the sum of pair values. General Alignment Methodology Alignment Scoring 1. Assume independent mutation model 2. Score at each position – Positive if the same/similar (e.g. – Negative if different or gap 3. Score of an alignment is sum of position score General Alignment Methodology Alignment Scoring • Different scoring different best alignments • Scoring systems implicitly represent a particular theory of evolution – Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys – Gap extension Vs. Gap opening General Alignment Methodology Alignment Scoring Scoring Matrix • A matrix n n : n=4 for DNA, n=20 for proteins • Each entry matrix defines the score for observing the two letters in the alignment A G C T – Positive if likely to change – Negative otherwise A 1 G -5 1 C -5 -5 1 T -5 -5 -5 1 General Alignment Methodology DNA scoring matrices • Transitions – purine to purine or pyrmidine to pyrmidine (4 possibilities) • Transversions – purine to pyrmidine or pyrmidine to purine (8 possibilities) • By chance alone transversions should occur twice as often as transitions. • De-facto transitions are more frequent than transversions. General Alignment Methodology DNA scoring matrices From To A G A G 2 -4 2 C T -6 -6 -6 -6 Transversion C T 2 -4 2 Transition Match General Alignment Methodology Proteins scoring matrices • Observation: some substitutions are more frequent than others, e.g., chemically similar amino acids • As for DNA, protein matrices define the probabilities of change between the different amino acids • Popular matrices are based on empirical data: PAM & BLOSUM T T T T T T L L L L L L Y Y Y Y Y Y D D E D Q D K K K K K K In the fourth column E and D are found in 7 / 8 General Alignment Methodology Proteins scoring matrices Category Amino Acid Acids and Amides Asp (D) Glu(E) Asn (N) Gln (Q) Basic His (H) Lys (K) Arg (R) Aromatic Phe (F) Tyr (Y) Trp (W) Hydrophilic Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T) Hydrophobic Ile (I) Leu (L) Met (M) Val (V) General Alignment Methodology BLOSUM Matrices • Based on BLOCKS database: ~2000 blocks from 500 families of related proteins • Blocks: short conserved patterns of 3-60 aa without gaps • Different BLOSUMn matrices are calculated independently from BLOCKS • BLOSUMn is based on sequences that shared at least n percent identity General Alignment Methodology BLOSUM Matrices • Low BLUSOM numbers for distant sequences • High BLUSOM numbers for similar sequence • Generally: – BLOSUM62 for general use – BLOSUM80 for close relations – BLOSUM45 for distant relations General Alignment Methodology Types of Gap Penalties (insertions or deletions) InDels are rare in evolution: once created, easy to extend: • Gap open – penalty for the first residue in a gap • Gap extension – penalty for additional residue in a gap. General Alignment Methodology Types of Gap Penalties Motivation: Aligning cDNAs to Genomic DNA cDNA query Genomic DNA Conclusion: gap opening and extension should be ranked differently to properly align the sequences General Alignment Methodology Summary: Scoring and Alignment • The final score of the alignment is the sum of the positive scores and penalty scores: Scoring + Number of Identities Matrix + Number of Similarities - Number of Gap insertions - Number of Gap extensions Alignment score Gap penalties Outline 1. Introduction 2. Applications 3. General Alignment Methodology 4. Pairwise Alignment: • Smith-Waterman • Needlman-Wunch 5. Multiple Sequence Alignment: • ClustalW • MUSCLE • T-coffee Pairwise Alignment Local vs. Global • Global alignment – finds the best alignment across the whole two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Local alignment – finds regions of similarity in parts of the sequences. ADLG |||| ADLG CDRYFQ |||| | CDRYYQ Pairwise Alignment Global: Needleman & Wunsch (1970) •The best alignment over the entire length of two sequences only •The Needleman-Wunsch algorithm is appropriate for finding the alignment of two sequences which are: (i) of the similar length; (ii) similar across their entire lengths. Example: SIMILARITY PI-LLAR--Needleman, S. B. and Wunsch, C. D., 1970 Pairwise Alignment Global: Needleman & Wunsch (1970) Pairwise Alignment Local: Smith & Waterman (1981) • Makes an optimal alignment of the best segment of similarity • Suitable when comparing substantially different sequences, and have only a short patches of similarity • Use when one sequence is short and the other is very long • Can return a number of highly aligned segments •For example, the local alignment of SIMILARITY and PILLAR: MILAR ILLAR Smith, T.F. and Waterman, M.S., 1981 Pairwise Alignment Local: Smith & Waterman (1981) Smith, T.F. and Waterman, M.S., 1981 Pairwise Alignment User Input • Pair of sequences • Local or global alignment • Scoring: – Gap penalties: opening/extension – Scoring matrix Outline 1. Introduction 2. Applications 3. General Alignment Methodology 4. Pairwise Alignment: • Smith-Waterman • Needlman-Wunch 5. Multiple Sequence Alignment: • ClustalW • MUSCLE • T-coffee Multiple Sequence Alignment Pairwise Vs. Multiple Sequence Alignment Alignments help to analyze sequence data: organize and visualize. Pairwise: For 2 sequences F G K G K G F G K F G K G MSA: For more than 2 sequences F F - G G G - K K K K F Q F G G G G K K K K G G G G Multiple Sequence Alignment Rules For Choosing Sequences • Very similar sequences have little information • Very different sequences cause trouble…<30% identical with more than half of the other sequences in the set • Choose sequences as distantly related as possible – Sequence between 30-80% identical with more than half of the sequences in the set • The more sequences the better Multiple Sequence Alignment Similarity Score of MSA • Each position gets a value, depending on its identity. • The similarity score of the alignment is the sum of all position values. • A popular way to compute position values: SP - Sum of Pairs - each pair gets the score from the similarity matrix (PAM, BLOSUM). Goal: Find MSA with maximum similarity score Bad News: This problem is NP hard Multiple Sequence Alignment More than a handful of MSA methods exist… APPROXIMATE FAST ACCURATE SLOW Multiple Sequence Alignment ClustalW (1994)- Introduction • This heuristic approach works because it uses the biological meaning of MSA • Based on the idea that the sequences we usually want to align are phylogenetically related: a pairwise alignment algorithm is used iteratively, first to align the most closely related pair of sequences, then the next most similar one to that pair. •Rule “once a gap, always a gap”: The gaps between more similar pairs of sequences should not be affected by more distantly related ones. Thompson, J.D. et al, 1994 Multiple Sequence Alignment ClustalW- Progressive Alignment Hbb_Human 1 Hbb_Horse 2 Hba_Human 3 Hba_Horse 4 Myg_Whale 5 17 - 59 60 - 59 59 13 - 77 77 75 75 1. Quick pairwise alignment calculate distance matrix - Hbb_Human Hbb_Horse Hba_Human Hba_Horse 2. Build a guide tree using the NJ phylogenetic method Myg_Whale 3. Progressive alignment following guide tree Multiple Sequence Alignment ClustalW- Progressive Alignment A B C D A - - - - B 1 - - - C 7 8 - - D 11 5 2 - A B C D Multiple Sequence Alignment ClustalW- Additional Features • Sequence weighting: – – – – Each sequence gets a weight derived from the guide tree Close sequences are down-weighted Distant sequences receive high weights The weights are normalized so that the highest is 1 w1 w2 w3 w4 w5 w6 w7 W(Hbb_Human) = .081 + ½*.226 + ¼*.061 + 1/5*.015 + 1/6*.062 = 0.221 Multiple Sequence Alignment ClustalW- Problems • Sequences that are similar only in some smaller regions ClustalW tries to find global alignments, not local. • Sequence that contains a large insertion compared to the rest global not local • Sequence that contains a repetitive element, while another sequence only contains one copy. Vs Multiple Sequence Alignment MUSCLE- Introduction • The most recent popular MSA software • Considered to be the most accurate MSA software available today • The basic idea: iterative progressive alignment Edgar, R.C., 2004 Multiple Sequence Alignment MUSCLE Innovations • Faster distance estimation between the input sequences • Faster construction of an evolutionary tree (UPGMA instead of NJ in ClustalW ) •Applying new score function to the profile alignments • Refinement of the initial results Edgar R.C., 2004 faster more accurate Multiple Sequence Alignment MUSCLE Innovations- Refinement Step • An edge is chosen from the progressive alignment tree. • The tree is divided into two subtrees by deleting this edge. • The MSA from each subtree is computed by progressive alignment. • The two MSAs are aligned, generating an entire new MSA • If the new MSA achieves higher score than the previous keep it New MSA Old MSA ---------------------------------------------------------------------------------------------------------- MSA1 ---------------------------------------- MSA2 --------------------------- ---------------------------------------------------------------------------------------------------------- Multiple Sequence Alignment MUSCLEIt’s Even More Complicated… Multiple Sequence Alignment All Against All- SH2 domains T-coffee MUSCLE Edgar, R.C., 2004 Multiple Sequence Alignment All Against All- BaliBase 2005 MUSCLE is superior in some cases…. Edgar, R.C., 2004 Multiple Sequence Alignment All Against All- PREFAB T-coffee in others… Trial and error is the best approach Edgar, R.C., 2004