Download Bioinformatics Sequencing

Sequence Alignment Arun Goja MITCON BIOPHARMA Why do we want to compare sequences? • Evolutionary relationships – Phylogenetic trees can be constructed based on comparison of the sequences of a molecule (example: 16S rRNA) taken from different species – Residues conserved during evolution play an important role • Prediction of protein structure and function – Proteins which are very similar in sequence generally have similar 3D structure and function as well – By searching a sequence of unknown structure against a database of known proteins the structure and/or function can in many cases be predicted WHY ? sequence alignment Sequence alignment is important for: * prediction of function * database searching * gene finding * sequence divergence * sequence assembly 3 Over time, genes accumulate mutations  Environmental factors • Radiation • Oxidation  Mistakes in replication or repair  Deletions, Duplications  Insertions  Inversions  Point mutations 4 Deletions • Codon deletion: ACG ATA GCG TAT GTA TAG CCG… – Effect depends on the protein, position, etc. – Almost always deleterious – Sometimes lethal • Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… – Almost always lethal 5 Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT 6 Comparing two sequences • Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT • Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT 7 Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion 8 Definition • Homology: related by descent • Homologous sequence positions ATTGCGC ATTGCGC C  ATTGCGC ATTGCGC  AT-CCGC  ATCCGC Orthologous and paralogous • Orthologous sequences differ because they are found in different species (a speciation event) • Paralogous sequences differ due to a gene duplication event • Sequences may be both orthologous and paralogous Sequence alignment - meaning Sequence alignment is used to study the evolution of the sequences from a common ancestor such as protein sequences or DNA sequences. Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions. Sequence alignment also refers to the process of constructing significant alignments in a database of potentially unrelated sequences. 11 Sequence alignment - definition Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences involved tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt 12 Pairwise alignment: the problem The number of possible pairwise alignments increases explosively with the length of the sequences: Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe. Pairwise Alignment • The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. – There are lots of possible alignments. • • Two sequences can always be aligned. • Sequence alignments have to be scored. • Often there is more than one solution with the same score. Methods of Alignment • By hand - slide sequences on two lines of a word processor • Dot plot – with windows • Rigorous mathematical approach – Dynamic programming (slow, optimal) • Heuristic methods (fast, approximate) – BLAST and FASTA • Word matching and hash tables0 Align by Hand GATCGCCTA_TTACGTCCTGGAC <---> AGGCATACGTA_GCCCTTTCGC You still need some kind of scoring system to find the best alignment Percent Sequence Identity • The extent to which two nucleotide or amino acid sequences are invariant AC C TG A G – AG AC G TG – G C AG mismatch indel 70% identical Dotplot: A dotplot gives an overview of all possible alignments Sequence 2 A T T C A C A T A                     T           A C  A T T   A C Sequence 1 G T  A C Dotplot: In a dotplot each diagonal corresponds to a possible (ungapped) alignment Sequence 2 A T T C A C A T A                     T           A C  A T   A C T G T  A C Sequence 1 One possible alignment: T A C A T T A C G T A C A T A C A C T T A Insertions / Deletions in a Dotplot Sequence 2 T A C T G T C A T T A C T G T T C A T Sequence 1 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T Alignment methods • Rigorous algorithms = Dynamic Programming – Needleman-Wunsch (global) – Smith-Waterman (local) • Heuristic algorithms (faster but approximate) • BLAST • FASTA Pairwise alignment Pairwise sequence alignment methods are concerned with finding the best-matching piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid) sequences. Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples. This information is useful for answering a variety of biological questions: 1. The identification of sequences of unknown structure or function. 2. The study of molecular evolution. 22 Dynamic Programming Approach to Sequence Alignment The dynamic programming approach to sequence alignment always tries to follow the best prior-result so far. Try to align two sequences by inserting some gaps at different locations, so as to maximize the score of this alignment. Score measurement is determined by "match award", "mismatch penalty" and "gap penalty". The higher the score, the better the alignment. If both penalties are set to 0, it aims to always find an alignment with maximum matches so far. Maximum match = largest number matches can have for one sequence by allowing all possible deletion of another sequence. It is used to compare the similarity between two sequences of DNA or Protein, to predict similarity of their functionalities. Examples: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981) 23 Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences. 24 Global Alignment • Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. • Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. • Global alignment is useful when you want to force two sequences to align over their entire length Global Alignment Find the global best fit between two sequences Example: the sequences s = VIVALASVEGAS and t = VIVADAVIS align like: A(s,t) = V I V A L A S V E G A S | | | | | | | V I V A D A - V - - I S indels 26 The Needleman-Wunsch algorithm The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to align protein or nucleotide sequences. The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score. The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953 27 Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This is not possible with global alignment methods. 28 The Smith Waterman algorithm The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences. Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme). However, the Smith-Waterman algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required. 29 Global vs. Local Alignments • Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. • Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there. Statistical analysis of alignments This works identical to gene finding: * Generate randomized sequences based on the second string * Determine the optimal alignments of the first sequence with these randomized sequences * Compute a histogram and rank the observed score in this histogram 31 The Needleman-Wunsch algorithm A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences. Needleman & Wunsch • • • • Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with gap penalty multiples Fill in the matrix with max value of 3 possible moves: – Vertical move: Score + gap penalty – Horizontal move: Score + gap penalty – Diagonal move: Score + match/mismatch score • The optimal alignment score is in the lower-right corner • To reconstruct the optimal alignment, trace back where the max at each step came from, stop when hit the origin. Example • Let gap = -2 match = 1 mismatch = -1. empty A A A C 0 -2 -4 -6 -8 A -2 1 -1 -3 -5 G -4 -1 0 -2 -4 C -6 -3 -2 -1 -1 empty AAAC A-GC AAAC -AGC Local Alignment • Problem first formulated: – Smith and Waterman (1981) • Problem: – Find an optimal alignment between a substring of s and a substring of t • Algorithm: – is a variant of the basic algorithm for global alignment Smith & Waterman • • • • Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4 possible values: – – – – 0 Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score • The optimal alignment score is the max in the matrix • To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit Local Alignment • Let gap = -2 match = 1 mismatch = -1. empty G A T A C C C GATCACCT GATACCC GATCACCT GAT _ ACCC empty G A T C A C C T 0 0 0 0 1 0 0 0 0 0 0 2 0 1 0 0 0 0 3 1 0 0 0 0 1 2 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 2 1 1 0 0 0 3 2 2 0 0 1 4 3 1 0 0 2 3 0 0 0 0 0 Pairwise alignment: the solution ”Dynamic programming” (the Needleman-Wunsch algorithm) Alignment depicted as path in matrix T C G C A T TCGCA TC-CA C C A T T C C A C G C A TCGCA T-CCA Alignment depicted as path in matrix T C G C A Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths). T C x C A Position labeled “x”: TC aligned with TC --TC TC-- -TC T-C TC TC Creation of an alignment path matrix • If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) • Three possibilities: • xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) • xi is aligned to a gap, F(i,j) = F(i-1,j) - d • yj is aligned to a gap, F(i,j) = F(i,j-1) - d • The best score up to (i,j) will be the largest of the three options Dynamic programming: computation of scores T C T C C A x G C A Any given point in matrix can only be reached from three possible previous positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. Dynamic programming: computation of scores T C G C T C x C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. A score(x,y-1) - gap-penalty score(x,y) = max Dynamic programming: computation of scores T C G C T C x C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. A score(x,y-1) - gap-penalty score(x,y) = max score(x-1,y-1) + substitution-score(x,y) Dynamic programming: computation of scores T C G C T C x C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. A score(x,y-1) - gap-penalty score(x,y) = max score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty Dynamic programming: computation of scores T C G C T C x C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. A Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y-1) - gap-penalty score(x,y) = max score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty Dynamic programming: example A C G T A C G T 1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 Gaps: -2 Dynamic programming: example Dynamic programming: example Dynamic programming: example Dynamic programming: example Dynamic programming: example T C G C A : : : : T C - C A 1+1-2+1+1 = 2 Global versus local alignments Global alignment: align full length of both sequences. “Needleman-Wunsch” algorithm). (The Global alignment Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Seq 1 Local alignment Seq 2 Local alignment overview • The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative. score(x,y-1) - gap-penalty score(x,y) = max score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty 0 • Trace-back is started at the highest value rather than in lower right corner • Trace-back is stopped as soon as a zero is encountered Local alignment: example Alignments: things to keep in mind “Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”. This is NOT necessarily the biologically most meaningful alignment. Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc. Pairwise alignment programs always produce an alignment even when it does not make sense to align sequences.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bioinformatics Sequencing