Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multilocus sequence typing wikipedia , lookup
Community fingerprinting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Molecular ecology wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic code wikipedia , lookup
Protein structure prediction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Pairwise Sequence Comparison Dong Xu Computer Science Department 271C Life Sciences Center E-mail: [email protected] 573-882-7064 http://digbio.missouri.edu Lecture Outline Introduction Scoring function Dynamic programming Confidence Assessment Heuristic alignment Introduction (1) Biological sequence comparison DNA-DNA, RNA-RNA Protein-protein DNA-protein Sequence comparison is the in most important and fundamental operation bioinformatics Key to understand evolution of an organism or a gene Introduction (2) Applications in most bioinformatics problems Sequence assembly Gene finding Protein structure prediction Phylogenic tree analysis THE most popular tool: BLAST Foundation of sequence database search Evolutionary Basis of Sequence Comparison Ancestor Orthologs (inheritance) Gene duplication Paralogs X Random mutation Evolutionary selection Recombination (fusion) Y 75%X 25%Y Mixed Homology An Example of Sequence Comparison TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT Alignement (1) a correspondence between elements of two sequences with order (topology) kept FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR pairwise alignment: 2 sequences aligned multiple alignment: alignment of 3 or more sequences Alignement (2) Similar to ”longest common subsequence” (LCS) problem for strings, (Robinson, 1938) LCS: define a set of operations (e.g. substitution, insertion or deletion) that transform the aligned elements of one sequence into the corresponding elements of the other and associate with each operation a cost or a score. Optimal alignment: the alignment that is associated with the lowest cost (or highest score). Between two sequences several optimal alignments can be constructed with the same optimal score. FSEY-THRGHR : : ::: :: FESYTTHRPHR FSEYT-HRGHR : :: :: :: FESYTTHRPHR Components of Sequence Alignment (2) An algorithm for alignment; (3) Confidence assessment of alignment result. Indel { (1) Scoring function: a measure of similarity between elements (nucleotides, amino acids, gaps); Insertion Deletion FDSK-THRGHR :.: :: ::: FESYWTH-GHR Match (:) Mismatch (substitution) Lecture Outline Introduction Scoring function Dynamic programming Confidence Assessment Heuristic alignment Edit Distance (Hamming Distance) Introduced by Levenshtein in 1966 Binary: match 1 / mismatch 0 (Identity Matrix) Definition: Minimum number of edit operations to transform one string to another Can be used for DNA/RNA Possible edit operations Symbol insertion Symbol deletion Symbol substitution Scoring Matrix amino acid substitution matrices (20X20) account for probability of one amino acid being substituted for another: frequency of substitution - genetic code tolerance for changes - natural selection penalize residues pairs with a low probability of mutation in evolution and rewards pairs with a high probability empirically derived from observed amino acid substitutions that occur between aligned residues in homologous sequences Physical Bases of Mutation Matrix Geometry nature Physical nature (charged or hydrophobic) Chemical nature Frequencies of amino acids physical property matrices PAM The first substitution matrices derived by Dayhoff et al. (1978) PAM (point accepted mutation) distance: Two sequences are defined to have diverged by one PAM unit if they show in average one accepted point mutation (i.e. one amino acid change) per hundred amino acids. Derived from the pairwise alignment of sequences less than 15% divergent. BLOSUM Block substitution matrices (Henikoff & Henikoff 1992) Blocks: highly conserved regions in a set of aligned protein sequences (local multiple alignment) Number of BLOSUM matrix (e.g. BLOSUM 62) indicates the cutoff of percent identity that defines the clusters - lower cutoffs allow more diverse sequences BLOSUM 62 Matrix A R N D C Q E G H I L K M F P S T W Y V 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 4 -2 2 0 -3 -2 -1 -2 -1 1 5 -1 -3 -1 0 -1 -3 -2 -2 5 0 -2 -1 -1 -1 -1 1 6 -4 -2 -2 1 3 -1 7 -1 -1 -4 -3 -2 4 1 -3 -2 -2 5 -2 -2 0 11 2 -3 7 -1 4 A R N D C Q E G H I L K M F P S T W Y V What Matrices to Use Close homolog: high cutoffs for BLOSUM (up to BLOSUM 90) or lower PAM values BLAST default: BLOSUM 62 Remote homolog: lower cutoffs for BLOSUM (down to BLOSUM 10) or high PAM values (PAM 200 or PAM 250) A best performer in structure prediction: PAM 250 Gap Penalty Functions Corresponding to insertion/deletion in evolution Typically linear gap penalties Easy to implement in algorithms Satisfactory performance in alignment Can be derived from alignment Known alignments Performance-based (sequence comparison) Affine Gap Penalty Most w(k) commonly used model = h + gk , k 1 ,with w(0) = 0. h: gap opening penalty; g: gap extension penalty h > g > 0 (e.g., for PAM250, 10.8 + 0.6k) Non-linear form: h + g log (k) FDS--THRGHR :.: :::::: FESYTTHRGHR FDS-T-HRGHR :.: : ::::: FESYTTHRGHR Lecture Outline Introduction Scoring function Dynamic programming Confidence Assessment Heuristic alignment Global vs. Local Alignment Global alignment: the alignment of complete sequences Good for comparing members of same protein family Needleman & Wunsch 1970 J Mol Biol 48:443 Local alignment: the alignment of segments of sequences ignore areas that show little similarity Smith & Waterman 1981, J Mol Biol, 147:195 modified from Needelman-Wunsh algorithm can be done with heuristics (FASTA and BLAST) Dot Matrix and Alignment Dot matrix: Score between cross-elements path: Mapping to an alignment AACG - GT A TGC ATCG G GT - TGC A AC G G T A T G C A1 1 1 T 1 1 C 1 1 G 1 11 G 1 11 G 1 1 1 T 1 1 T 1 1 G 1 1 1 C 1 1 Dynamic Programming Steps 1. Assign scores between elements in dot matrix 2. For each cell in the dot matrix, check all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the last cell in the dot matrix (or the highest scoring) cell to give the highest scoring alignment Dynamic programming Foundation: any partial subpath ending at a point along the true optimal path must itself be an optimal path leading up to that point. So the optimal path can be found by incremental extensions of optimal subpaths One of the most fundemental algorithms in bioinfomatics Applications: sequence comparison, gene finding, mass-spec data analysis, ... Needleman-Wunsch Algorithm (1) Global alignment Elementary operations: Single insertion/deletions (s(ai,-) or s(-,bj)) Substitution (s(ai,bj)) Easy case: h=0 for gap penalty (h+gk) Needleman-Wunsch Algorithm (2) S00 = 0 i Si 0 = s(ak ,-) 1 i l1 k =1 j S0 j = s(-, bk ) 1 j l2 k =1 Si -1, j + s(ai ,-) Sij =max Si -1, j-1 + s (a i,bj) 1 i l1 , 1 j l2 Si , j -1 + s(-,bj ) The optimal score ending at i & j Needleman-Wunsch Algorithm (3) Calculate S(i,j) in three ways: •By adding a score s(ai,bj) to the score diagonally upwards, i.e. S(i-1,j-1); i A j S(i-1,j-1) S(i,j-1) A S(i-1,j) S(i,j) A T C •By adding a score s (-,bj) (which represents the introduction of a gap into the alignment) to the score vertically above (i.e. S(i,j-1); •Or by adding s(ai,-) to the score horizontally to the left (i.e. S(i-1,j) A Alignment Construction (1) Initialization: S(0,0) = 0 the outside row and column are given incrementally decreasing values 0 A -1 T -2 C -3 ... ... A A C ... -1 -2 -3 ... -AT (AAC) (C) Alignment Construction (2) S(1,1) : one of three values: (1) ai = bj, s = 1 S(i-1,j-1) + s(ai,bj) = 0+1 = 1 (2) add s(-,bj) to S(i,j-1) s(i,j-1) - s(-,bj) = -2 (3) add s(ai,-) to S(i-1,j) s(i-1,j) - s(ai,-) = -2 choose highest 1 in the cell. A A C ... 0 -1 -2 -3 ... A -1 1 T -2 C -3 ... ... A(AC) A-(AC) -A(AC) A(TC) -A(TC) A-(TC) Alignment Construction (3) For the next cell, as ai = bj again, s(ai,bj) = 1 and the three possible scores are: i,j i, j-1 i-1, j -1 + 1 = 0 -2 - 1 = -3 1-1=0 Two degenerate paths! (Max=3) A A C 0 -1 -2 -3 A -1 1 0 T -2 C -3 ... ... ... ... Alignment Construction (4) For the next cell, as ai bj, s(ai,bj) = 0. The three possible scores are: i,j -2 + 0 = -2 i, j-1 -3 - 1 = -4 i-1, j 0 - 1 = -1 A A C 0 -1 -2 -3 A -1 1 0 -1 T -2 C -3 ... ... ... ... Alignment Construction (5) A Trace back: C C AC TC AAC ATC A C 0 -1 -2 -3 A -1 1 0 -1 T -2 0 1 0 C -3 -1 0 2 . . . . . . . . . . . . Mathematical Representation Initialization Si 0 = -i S0 j = - j Scoring Length sequence 1 0 < i l1 0 < j l2 Length sequence 2 Si -1, j - 1 Sij = max Si -1, j -1 + s (ai , bj ) 0 < i l1 ,0 < j l2 Si , j -1 - 1 with 1 for ai = bj s (ai , bj ) = else 0 Computational Complexity of Dynamic Programming Computing time: O(nm), where n and m are sequence lengths). Retrieval time: O(Max (n,m)) [worst case: n+m; best case: Min(n,m)] Required memory: O(nm). Keeping in mind the computational complexity while programming Smith-Waterman Algorithm S0,j = Si,0 = 0 for 0 i l1 and 0 j l2 Si-1,j + s (ai , - ) Sij = max Set all values in top row and left column to 0 Si-1,j-1+ s (ai ,bj) 1 i l1 , 1 j l2 Si,j-1 + s ( - ,bj ) Set the value of Sij to 0 if it would otherwise be less than 0 0 Traceback from highest value of Sij, rather than from bottom right corner. Stop at 0 rather than top left corner. Smith-Waterman Algorithm With Affine Gap Penalties Dij = max Boundary conditions: S0,j = Si,0 = D0,j = Di,0 = I 0, j = Ii,0 = 0 Iij = max for 0 i l1 and 0 j l2 Si-1,j - h - g Di-1,j - g Si ,j-1 - h – g Ii,j-1 - g Si-1,j -1 + s(ai ,bj ) D i-1,j -1 + s(ai ,bj ) Sij = max I i-1,j -1 + s(ai ,bj ) 0 for 1 i l1 and 1 j l2 Gap penalty: w(k) = h + gk , k 1 Alignment retrieval for SmithWaterman algorithm GCC-UCG GCCAUUG A A U G C C A U U G A C G G C 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 A 1.0 1.0 0.0 0.0 0.0 0.7 2.0 0.7 0.3 0.0 1.0 0.0 0.7 0.0 G 0.0 0.7 0.8 1.0 0.0 0.0 0.7 1.7 0.3 1.3 0.0 0.7 1.0 1.7 C 0.0 0.0 0.3 0.3 2.0 1.0 0.3 0.3 1.3 0.0 1.0 1.0 0.3 0.7 C 0.0 0.0 0.0 0.0 1.3 3.0 1.7 1.3 1.0 1.0 0.3 2.0 0.7 0.3 U 0.0 0.0 0.0 0.0 0.3 1.7 2.7 2.7 2.3 1.0 0.7 0.7 1.7 0.3 C 0.0 0.0 0.0 0.7 1.0 1.3 1.3 2.3 2.3 2.0 0.7 1.7 0.3 1.3 G 0.0 0.0 0.0 1.0 0.3 1.0 1.0 1.0 2.0 3.3 2.0 1.7 2.7 1.3 C 0.0 0.0 0.0 0.0 2.0 1.3 0.7 0.7 0.7 2.0 3.0 3.0 1.7 2.3 U 0.0 0.0 1.0 0.0 0.7 1.7 1.0 1.7 1.7 1.7 1.7 2.7 2.7 1.3 U 0.0 0.0 1.0 0.7 0.3 0.3 1.3 2.0 2.7 1.3 1.3 1.3 2.3 2.3 A 1.0 1.0 0.0 0.7 0.3 0.0 1.3 1.0 1.7 2.3 2.3 1.0 1.0 2.0 G 0.0 0.7 0.7 1.0 0.3 0.0 0.0 1.0 1.0 2.7 2.0 2.0 2.0 2.0 Lecture Outline Introduction Scoring function Dynamic programming Confidence Assessment Heuristic alignment Confidence Assessment of Sequence Alignment Why confidence assessment is needed True homology or alignment by chance Expected probability by chance Statistical models Why not to use sequence identity as confidence measure What to Compare Real but non-homologous sequences Real sequences that are shuffled to preserve compositional properties Sequences that are generated randomly based upon a DNA or protein sequence model (analytic statistical results) P-value or E-value The probability that a variate would assume a value greater than or equal to the observed value strictly by chance P(z>zo) If the P-value found for an alignment is low (<0.001) then alignment is probably biologically meaningful. Pre-compute the parameters based on a statistical model Extremal Value Distribution The maximum scores of a large number of alignments between random sequences of equal length tends to have an extreme value distribution P(S’<x)=exp[-exp(-x)] P(S’>=x)= 1- exp[-exp(-x)] Bit scores: S’= (S-lnK)/ln2 and K: scaling factors (depending on composition, mutation matrix used etc.) An Example 100000 alignments of between unrelated proteins using Pam250 The tail of high SD scores Other Issues Gapless alignment vs. gapped alignment Low complexity regions Over-estimate or under-estimate of confidence level Scalability of Software The trend of genetic data growth 30 billion in year 2005 Genomes: yeast, human, rice, mouse, fly... Software must be (linearly or close to linearly) scalable to large datasets. Need for Heuristic Alignment Time complexity for optimal alignment: O(n2) , n -- sequence length Given the current size of sequence databases, use of optimal algorithms is not practical for database search Heuristic techniques: BLAST, FASTA, MUMmer, PatternHunter... 20 min (optimal alignment, SSearch) 2min (FASTA) 20 sec (BLAST) Ideas in Heuristics Search Indexing and filtering: Google search Good alignment includes short identical, or similar fragments break entire string into substrings, index the substrings Search for matching short substrings and use as seed for further analysis extend to entire string and find the most significant local alignment segment FASTA (1) Lipman & Pearson, 1985, Science 227, 1435-1441 Four steps Step 1: Identify regions of the sequences with the highest density of matches. In this step exact matches of a given length (by default 2 for proteins, 6 for nucleic acids) are determined and regions (fragments of diagonals) with a high number of matches selected. FASTA (2) FASTA (3) Step 2: Rescan 10 regions with the highest density of identities using the BLOSUM50 matrix. Trim the ends of region to include only those residues contributing to the highest score. Each region is a partial alignment without gaps. Step 3: If there are several regions with scores greater than a cut off value, try to join these regions. A score for the joined initial regions is calculated given a penalty for each gap. FASTA (4) FASTA (5) Step 4: Select the sequences in the database with the highest score. For each of those sequences construct a Smith-Waterman optimal alignment considering only the positions that lie in a band centred on the best initial region found. FASTA (6) A-FTFWSYAIGL--PSSSIVSWKSCHVLHKVLRDGHPNVLHDCQRYRSNI | |||||| || || |||| | | |... | : AIPQFWSYAIERPLNSSWIVVWKSCITTHHLMVYGNERFIQYLAS-RNTL Step 1 Step 2 Step 3 / Step 4 BLAST (1) Basic Alignment Search Tool (Altschul et al, 1990, J. Mol. Biol. 215, 403-410) Uses word matching like FASTA Similarity matching of words (3 aa’s, 11 bases) does not require identical words. If no words are similar, then no alignment won’t find matches for very short sequences BLAST (2) Detects alignments with optimal maximal segment pair (MSP) score. Gaps are not allowed. MSP: Highest scoring pair of segments of identical length. A segment pair is locally maximal if it cannot be improved by extending or shortening the segments. Homologous sequences will contain a MSP with a high score; others will be filtered out. BLAST (3) BLAST (4) BLAST (5) Does not handle gaps well Genome Alignment by PatternHunter (4 seconds) Comparison with Blastn and MegaBlast Blastn MegaBlast PatternHunter E.coli vs H.inf 716s 5s/561M 14s/68M Arabidopsis 2 vs 4 -21720s/1087M 498s/280M Human 21 vs 22 --5250s/417M Human (3G) vs Mouse 20 days Secret in PatternHunter Seeds (length of word used) Blastn finds a match of length 11, then extend from there. MegaBlast in order to increase speed, increases this to 28. Dilemma: increasing seed size speeds up but loses sensitivity; decreasing seed size gains sensitivity but loses speed. Spaced Seed: PatternHunter looks for matches of 11 nonconsecutive matches and optimized such seeding scheme. Super Seeds ATTTCCGACGCGAGGGGACTTTCAGGAGAG AGGGGACTTTC 11111111111 ATTTCCGACGCGAGGGGACTTTCAGGAGAG GTGATGGAACAATCGAGA 101101101100110011 G*GA*GG*AC**TC**GA Reading Assignments Suggested reading: Chapter 6 in “Pavel Pevzner: Computational Molecular Biology - An Algorithmic Approach. MIT Press, 2000.” Optional reading: http://www.people.virginia.edu/~wrp/papers/i smb2000.pdf Optional Assignment (1) 1. What does 200 mean in PAM 200? 2. What does 62 mean in BLOSUM 62? 3. Prove Needleman-Wunsch algorithm produces optimal global alignment. 4. Prove the computational complexity of SmithWaterman algorithm. 5. What is the relationship between the alignment score and statistical significance of the alignment? Optional Assignment (2) Explain the difference between PAM40 and PAM250. Why some elements in the matrix have different signs? Optional Assignment (3) Construct dynamic program matrix using edit distance and PAM250 distance, respectively. 1. Try different affine gap penalties, h=g or not 2. Try global and local alignment. M P R C L C T M P C L W C Q Project Assignment (1) Develop a program that can perform optimal global-local alignment for DNA sequences: 1. Fit a short sequence into a long sequence and output ALL optimal alignments. 2. No gap penalty when deleting terminal bases of the long sequence, but with gap penalty for deleting any base of the short sequence. 3. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size). Project Assignment (2) 4. Use the FASTA format for input of each sequence (see http://www.g2l.bio.unigoettingen.de/blast/fastades.html). 5. Test on sample sequences, e.g., TTTGAGCCTCTGTTTGTGTGTAATTGAT-GTGCATGTGTGGG || |||||| || |||| ................TG-GTAATTAATCATGCAC....... The above format can be used for your outputs.