* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download La comparaison de séquence
Biochemistry wikipedia , lookup
Non-coding DNA wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Metalloprotein wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Proteolysis wikipedia , lookup
Genetic code wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein structure prediction wikipedia , lookup
Sequence comparaison A little bit of theory does not hurt! Ahmed Rebaï Definitions Comparative method: analyze similarities and differences between sequence residues in order to infer structural, functionnal or evolutionary relationship Sequence alignment: procedure of comparing 2 or more sequences by searching for a series of individual residues or residue patterns that are in the same order in the sequences Global alignment: stretched over the entire sequence length to include as many matching residues as possible up to the sequence ends Local alignment: stops at the ends of regions of identity or strong similarity and priority is given to find conserved patterns than to extend the alignment to include neighbouring residues (we will see how!) Homology Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant AC C TG A G – AG AC G TG – G C AG mismatch indel 70% identical Making a Scoring Matrix Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations in the sequence. Some of these mutations have little effect on the organism’s function, therefore some penalties will be less harsh than others. Scoring Matrix: Example A R N K A 5 -2 -1 -1 R - 7 -1 3 N - - 7 0 K - - - 6 • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. Conservation Amino acid changes that preserve the physico-chemical properties of the original residue Polar to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine Alignment scores A global score is calculated as: a score is attributed to each position of the alignment according to: Match/mismatch/gap in DNA sequences Substitution of an aminoacids by another gap for protein and DNA sequences We add scores for the whole alignment: S= S_elem + S_Gap Gap scores are negative because these are penalities By choosing these scores we can either favour alignments with few gaps (or ungapped) or with many gaps Scoring Indels: Naive Approach A fixed penalty σ is given to every indel: -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels Affine Gap Penalties In nature, very often indels come as a unit, not just at 1 nucleotide at a time. In nature, this is more likely. Normal scoring would give the same score for both alignments Gap scores The Gap score depends on its length P= x+ y l x: Gap-opening penality y: elementary gap extension penality y<< x (x/10) for example by default: x=-11, y=-1 (blastp) and x=-5, y=-2 (Blastn) By this choice a large gap is more likely than many small gaps of te same length (which is consistent with biological phenomena) Score matrix or substitution matrix for DNA sequences: 1 for a match, -2 for a mismatch for protein sequences: 20x20 matrix Based on model of protein evolution Based on analysis of highly conserved segments in proteins (Blocks) Based on physico-chemical or structural properties PAM Matrices (Dayhoff et al., 1978) list the likelihood of change from one aa to another in homologus protein sequences during evolution The change of an aa A by B is assumed the same regardless of any pevious change at that site and the position of A in the protein Changes are in closely related protein and are assumed to represent aa substitution tha do not change the protein function and are called «accepted mutations » PAM: point accepted mutation 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may mot changed at all PAM matrix Relative mutability of aa is evaluated by counting (in a group of related sequences) the number of changes of each aa and dividing by the « exposure to mutation » of that aa Exposure is calculated as the product of the frequency of occurrence of the aa in the group of seq. and the total number of aa changes the occured in that group per 100 sites Calculated from etimations of aa susbtitution that occured in a group of evolving porteins using 1572 changes in 71 groups of proteins sequences that were at least 85% similar. PAM matrix The most mutable aa are Asn, Ser, Asp and Glu and the least mutable are Cys and Try One we get PAM1 matrix we should multiply it by itself N times if we want to predict changes for more distantly related proteins that have undergone N mutations The commonl used is PAM250 which represents a level of 250 changes for 100 aa expected in 2500 my. Such sequences are expected to have about 20% similarity PAM matrix 260 changes were observed between Phe and Tyr among the 1572 changes The scoe of changing Phe to Tyr wa 0.0021 (0.9946 for not changing Phe) PAM-250 the proba of Phe to Tyr chnage is 0.15 an that of no change is 0.32 The frequency of Phe was 0.04 so the relative frequency of of change is 0.15/0.04=3.75 . 10Log10(3.75)=5.7. The score for a change of Tyr to phe was found 8.3 so use the average value of these two which is 7 PAM250 PAM matrix Values are between -8 to 17 0 indicates the the frequency of the subsitution between these two aa is expected by chance negative means that the frequency of substitution is less than expected by chance For example no cnages between Gly (G) to Trp (W) resulting in a score of -7 BLOSUM (Henikoff et al, 1993) Blocks Substitution Matrix Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity BLOSUM matrix An alternative that does not involve an explicit model of evolutionary change is to just find blocks of similar sequence and count how often one amino acid is substituted by another Good for general comparisons without consideration of evolutionary distance Comparison PAM-BLOSUM Choice of a Matrix! BLOSUM90 PAM30 Rat versus mouse RBP BLOSUM80 PAM120 BLOSUM62 PAM180 BLOSUM45 PAM240 Rat versus bacterial lipocalin Methods for aligning 2 sequences Dot matrix analysis Dynamic programming algorithm Word or k-tuple methods such that used in the programs BLAST and FASTA Bayesian sequence alignment algorithms The Dot-plot: a first step ! Dot plot Dynamic programming Compoutational method that provides de very best (optimal) alignment between two sequences Compare every pair of residues in the two sequences and includes: matches, mismatches and gaps ATGT- AATGCATG |||| ||| | ATGTGAAT -- A Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 (0,0) , (1,1) 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 (0,0) , (1,1) , (2,2) 7 T _ 6 7 _ C 7 Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 (0,0) , (1,1) , (2,2), (2,3), (3,4) Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 - End Result (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) Alignments in Edit Graph and represent indels in v and w • Score 0. represent exact matches. • Score 1. Alignments in Edit Graph The score of the alignment path in the graph is 5. Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment: Alignment as a Path in the Edit Graph Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667 Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) Alignment: Dynamic Programming Use this scoring algorithm si,j = max si-1, j-1+1 if vi = wj si-1, j si, j-1 Dynamic Programming Example • There are no matches in the beginning of the sequence • Label column i=1 to be all zero, and row j=1 to be all zero Dynamic Programming Example Si,j = max Si-1, j-1 value from NW +1, if v = w i j Si-1, j value from North (top) Si, j-1 value from West (left) Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj Backtracking Example Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, S(i,j) = S(i-1,j-1) +1 S(2,2) S(2,5) S(4,2) S(5,2) S(7,2) = = = = = [S(1,1) [S(1,4) [S(3,1) [S(4,1) [S(6,1) = = = = = 1] 1] 1] 1] 1] + + + + + 1 1 1 1 1 Backtracking Example Continuing with the scoring algorithm gives this result. Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. Local vs. Global Alignment (cont’d) Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C Local Alignment—better alignment to find conserved segment tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence The Local Alignment Problem Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings The Local Alignment Recurrence The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence is shown below: 0 si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) Notice there is only this change from the original recurrence of a Global Alignment Affine Gap Penalty Recurrences si,j = s i-1,j - σ s i-1,j –(ρ+σ) si,j = Continue Gap in v (insertion) s i,j-1 - σ s i,j-1 –(ρ+σ) Start Gap in v (insertion):from middle max max si,j = max Continue Gap in w (deletion) Start Gap in w (deletion): from middle Match or Mismatch si-1,j-1 + δ (vi, wj) End deletion: from top s i,j End insertion: from bottom s i,j