Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore Sequence Analysis Methods 2 Gene and Protein Sequence Alignment as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap What is a good alignment? 3 How to rate an alignment? • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) a1 a2 a3 - - x - b1 b2 b3 - - y - - 4 Pairwise Alignment Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b: Mismatch Match C---TTAACT CGGATCA--T Insertion gap Deletion gap 5 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T Insertion gap T C A T C---TTAACT CGGATCA--T T A Deletion gap A C T 6 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACT CGGATCA--T 7 Graphic representation of an alignment Sequence a: CTTAACT C C G G Sequence b: CGGATCAT A C---TTAACT CGGATCA--T 8 Graphic representation of an alignment Sequence a: CTTAACT C C T G G Sequence b: CGGATCAT A T C---TTAACT CGGATCA--T 9 Graphic representation of an alignment Sequence a: CTTAACT C C T G G Sequence b: CGGATCAT A T C A C---TTAACT CGGATCA--T T A A C 10 Graphic representation of an alignment Sequence a: CTTAACT C C T G G Sequence b: CGGATCAT A T C A T C---TTAACT CGGATCA--T T A A C T 11 Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T T C A T C---TTAACT CGGATCA--T T A A C T 12 Graphic representation of an alignment Sequence a: CTTAACT C C T G G Sequence b: CGGATCAT A T C A T CTTAACTCGGATCAT T A A C T 13 Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T T C A T CTTAACTCGGATCAT T A A C T 14 Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T C A T - CTTAACT CGGATCAT T T A A C T 15 Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T C A T - C - - TTAACT CGGATC - AT - T T A A C T 16 Use of graph to generate alignments Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T C A T CTTAACT - - - - CGGATCAT T T A A C T 17 Which pathway is better? Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T C A T Multiple pathways T T A A Each with a unique scoring function C T 18 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A 8 C T T C A T C---TTAACT CGGATCA--T T A A C T 19 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T 8 8-3 =5 T C A T C---TTAACT CGGATCA--T T A A C T 20 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C T 8 8-3 =5 5-3 =2 2-3 =-1 T C A T C---TTAACT CGGATCA--T T A A C T 21 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT C G G A 8 C T T A A C T 5 2 T C A T -1 C---TTAACT CGGATCA--T -1+8 =7 7-3 =4 4+8 =12 12-3 =9 9-3 =6 Alignment score 6+8=14 22 An optimal alignment -- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computed si 1, j w(ai ,) as follows. si , j max si , j 1 w(, b j ) s i 1, j 1 w(ai , b j ) 23 Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n 24 Initializations Gap symbol: -3 C 0 C -3 T -6 T -9 -3 G -6 G -9 A T C A T -12 -15 -18 -21 -24 S0,0= 0 S0,1=-3, S0,2=-6, S0,3=-9, S0,4=-12, S0,5=-15, S0,6=-18, A -12 S0,7=-21, S0,8=-24 S1,0=-3, S2,0=-6, S3,0=-9, S4,0=-12, S5,0=-15, S6,0=-18, A -15 S7,0=-21 C -18 T -21 25 Match: 8 S1,1 = ? Mismatch: -5 Gap symbol: -3 C 0 -3 C -3 ? T -6 T -9 Option 1: G -6 G A T C A T -9 -12 -15 -18 -21 -24 S1,1 = S0,0 +w(a1, b1) = 0 +8 = 8 Option 2: S1,1=S0,1 + w(a1, -) = -3 - 3 = -6 A -12 A -15 Option 3: S1,1=S1,0 + w( - , b1) = -3-3 = -6 C -18 Optimal: T -21 S1,1 = 8 26 Match: 8 S1,2 = ? Mismatch: -5 Gap symbol: -3 C Option 1: G 0 -3 -6 C -3 8 ? T -6 T -9 G A T C A T -9 -12 -15 -18 -21 -24 S1,2 = S0,1 +w(a1, b2) = -3 -5 = -8 Option 2: S1,2=S0,2 + w(a1, -) = -6 - 3 = -9 A -12 A -15 Option 3: S1,2=S1,1 + w( - , b2) = 8-3 = 5 C -18 Optimal: T -21 S1,2 =5 27 Match: 8 S2,1 = ? Mismatch: -5 Option 1: Gap symbol: -3 C G 0 -3 -6 C -3 8 5 T -6 ? T -9 G A T C A T -9 -12 -15 -18 -21 -24 S2,1= S1,0 +w(a2, b1) = -3 -5 = -8 Option 2: S2,1=S1,1 + w(a2, -) =8-3=5 A -12 A -15 Option 3: S2,1=S2,0 + w( - , b1) = -6-3 = -9 C -18 Optimal: T -21 S2,1 =5 28 Match: 8 S2,2 = ? Mismatch: -5 Gap symbol: -3 C Option 1: G G A T C A T -9 -12 -15 -18 -21 -24 S2,2= S1,1 +w(a2, b2) = 8 -5 = 3 0 -3 -6 C -3 8 5 Option 2: T -6 5 ? S2,2=S1,2 + w(a2, -) T -9 =5-3=2 A -12 A -15 Option 3: S2,2=S2,1 + w( - , b2) = 5-3 = 2 C -18 Optimal: T -21 S2,2 =3 29 S3,5 = ? C G G A T C A T 0 -3 -6 -9 -12 -15 -18 -21 -24 C -3 8 5 2 -1 -4 -7 T -6 5 3 0 -3 7 4 T -9 2 0 -2 -5 ? -10 -13 1 -2 A -12 A -15 C -18 T -21 30 S3,5 = ? C G G A T C A T 0 -3 -6 -9 -12 -15 -18 -21 -24 C -3 8 5 2 -1 -4 -7 T -6 5 3 0 -3 7 4 1 -2 T -9 2 0 -2 -5 5 -1 -4 9 A -12 -1 -3 -5 6 3 0 7 6 A -15 -4 -6 -8 3 1 -2 8 5 C -18 -7 -9 -11 0 -2 9 6 3 T -21 -10 -12 -14 -3 8 6 4 14 -10 -13 optimal score 31 C T T A A C – T C G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T 0 -3 -6 -9 -12 -15 -18 -21 -24 C -3 8 5 2 -1 -4 -7 T -6 5 3 0 -3 7 4 1 -2 T -9 2 0 -2 -5 5 -1 -4 9 A -12 -1 -3 -5 6 3 0 7 6 A -15 -4 -6 -8 3 1 -2 8 5 C -18 -7 -9 -11 0 -2 9 6 3 T -21 -10 -12 -14 -3 8 6 4 14 -10 -13 32 Local vs. Global Sequence Alignment: Example: DNA sequence a: ATTCTTGC DNA sequence b: ATCCTATTCTAGC Local Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Global Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Gaps ignored in local alignments Gaps counted in global alignments 33 Global Alignment vs. Local Alignment • global alignment: • local alignment: All sections are counted Only local sections (normally separated by gaps) are counted 34 An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computed 0 as follows. s w(a ,) i i 1, j si , j max si , j 1 w(, b j ) s i 1, j 1 w( ai , b j ) 35 Match: 8 Initializations Mismatch: -5 Gap symbol: -3 0 C 0 T 0 T 0 A 0 A 0 C 0 T 0 C G G A 0 0 0 0 T 0 C 0 A 0 T 0 36 Match: 8 S1,1 = ? Mismatch: -5 Gap symbol: -3 0 C 0 T 0 T 0 A 0 = 0 +8 = 8 C G G A 0 0 0 0 ? S1,1 = S0,0 +w(a1, b1) T 0 C 0 A 0 T 0 Option 2: S1,1=S0,1 + w(a1, -) = 0 - 3 = -3 Option 3: A 0 C 0 T Option 1: 0 S1,1=S1,0 + w( - , b1) = 0-3 = -3 Option 4: S1,1=0 Optimal: S1,1 = 8 37 local alignment Match: 8 Mismatch: -5 Gap symbol: -3 0 C 0 G 0 G 0 A 0 T 0 C A T 0 0 0 C 0 8 5 2 0 0 8 5 2 T 0 5 3 0 0 8 5 3 13 T 0 2 0 0 0 8 5 2 11 A 0 0 0 0 8 5 3 ? A 0 C 0 T 0 38 A – C - T A T C A T 8-3+8-3+8 = 18 C G 0 0 0 local alignment G 0 A 0 T 0 C A T 0 0 0 C 0 8 5 2 0 0 8 5 2 T 0 5 3 0 0 8 5 3 13 T 0 2 0 0 0 8 5 2 11 A 0 0 0 0 8 5 3 13 10 A 0 0 0 0 8 5 2 11 8 C 0 8 5 2 5 3 13 10 7 T 0 5 3 0 2 13 10 8 The best score 18 39 BLAST Basic Local Alignment Search Tool Procedure: • • • • Divide all sequences into overlapping constituent words (size k) Build the hash table for Sequence a. Scan Sequence b for hits. Extend hits. 40 BLAST Basic Local Alignment Search Tool Step 1: Hash table for sequence A 41 Amino acid similarity matrix PAM 120 Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids 42 Amino acid similarity matrix PAM 250 This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps. 43 Amino acid similarity matrix Blosum 45 The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. . 44 BLAST Basic Local Alignment Search Tool 45 BLAST Basic Local Alignment Search Tool Step 2: Use all of the 2letter words in query sequence to scan against database sequence and mark those with score > 8 Note: LN:LN=9 NF:NY=8 Marked points can be on the diagonal and off-diagonal GW:PW=10 46 BLAST Step2: Scan sequence b for hits. 47 BLAST Step2: Scan sequence b for hits. Step 3: Extend hits. hit Terminate if the score of the extension fades away. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. 48 Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC GC-TC Seq2: AC A---C Seq3: GATC G-ATC 49 Multiple sequence alignment MSA 50 How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC Score GC-TC Score A---C G-ATC + A---C GC-TC = Score G-ATC + A---C Score G-ATC 51 How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC Score GC-TC Score A---C G-ATC + A---C + GC-TC = Score G-ATC + A---C Score -5-3+8-3+8= 5 8-3-3+8+8= 18 + -5+8-3-3+8= 5 G-ATC = 28 SP-score=5+18+5=28 52 Position Specific Iterated BLAST • PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST • Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities. 53 Position Specific Iterated BLAST PSI-BLAST is used for: Distant homology detection Fold assignment: profile-profile comparison Domain identification Evolutionary Analysis (e.g. tree building) Sequence Annotation / function assignment Profile export to other programs Sequence clustering Structural genomics target selection 54 Position Specific Iterated BLAST • Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.001, but all sequences with E<10 are displayed for manual inclusion) • Construct position specific scoring matrix for collected sequences. Rough idea: – Align all sequences to the query sequence as the template. – Assign weights to the sequences – Construct position specific scoring matrix • Iterate 55 How PLS-BLAST works? A 029001100003200 MGLLTREIF--ILQQ C 000070000000000 . . Y 002000080202000 MGLLTREIF--ILQQ FGLGRT-I-T-YMTN FGLLRT-I-T-YMTN -GLVRT-I---LGLE -RLTRD-I---LGLY FGLLRT-I---YMTQ FGLLRT-I---FMTS Take a sequence using profile Search for similar sequences in a full sequence database Sequences are multiply aligned alignment New sequences in the multiple After several iterations of this procedure we have: 027005101003200 A 029001100003200 Construct newtoprofile aa profile, and represent •C 000070000000000 Sequence information, Construct including links annotation . conservation in each position numerically •. Several sets of multiple alignments. Y 002000080202000 202000060202000 • • Profiles, derived by us Profile or by PSI-BLAST holds more information than a single the profile to retrieve additional Threshold information sequence: (alignmentuse statistics) sequences Consensus sequence • A sequence where each position is defined by majority vote based on multiple sequence alignment. Use consensus sequence for data base search. PEAINYGRFTPFS I KSDVW 57 Flow chart of PSI-BLAST MGLLTREIF--ILQQ MGLLTREIF--ILQQ FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ A 029001100003200 C 000070000000000 . . Y 002000080202000 Take a sequence Search for similar sequences in a full sequence database Sequences are multiply aligned Construct a profile, and represent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences New iteration A 029001100003200 C 000070000000000 . . Y 002000080202000 FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS A 027005101003200 C 000070000000000 . . Y 202000060202000 Next New iteration…… Using profile to search for similar sequences in a full sequence database New sequences in the multiple alignments Construct a new profile 58 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 59 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 60 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 61 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 62 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 63 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 64 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 65 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 66 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 67 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 68 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 69 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 70 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 71 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 72 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 73 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 74 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 75 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 76 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 77 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 78 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 79 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 80 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 81 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 82 PSI-BLAST NCBI PSI-BLAST tutorial : http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html 83 Summary of Today’s lecture • Sequence alignment methods revisited: – – – – Pair-wise alignment Multiple sequence alignment BLAST PSI-BLAST • Use of PSI-BLAST to probe protein function 84