Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Blast & Multiple Alignment Scoring Alignments • Quality = [10(match)] + [-1(mismatch)] [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)] Scoring scheme incorporates an evolutionary model-• Matches are conserved • Mismatches are divergences • Gaps are more likely to disrupt function, hence greater penalty than mismatch. Introduction of a gap (indel) penalized more than extension of a gap. • Both Global and Local alignment programs will (almost) always give a match. • It is important to determine if the match is biologically relevant. • Not necessarily relevant: Low complexity regions. – Sequence repeats (glutamine runs) – Transmembrane regions (high in hydrophobes) • If working with coding regions, you are typically better off comparing protein sequences. Greater information content. Substitution Matrices Substitution Matrix • Nucleic Acid • Incorporates the observation that Transitions (A<>G or C<->T are more common than Transversions • Amino Acid Substitution Substitutions • 20 different amino acids – Physical and chemical properties of some are similar. A useful classification of amino acids • • • • • Aliphatic - G, A, V, L, I, P Aromatic - F, Y, W Uncharged polar - S, T, N, Q Charged - D, E, H, K, R Sulfur-containing - C, M Amino Acid Substitution Matrix • Accounts for the observation that some amino acid substitutions are better tolerated than others. • Other types of substitutions are rare. A C D E A C D E 2 -2 0 0 12 -5 -5 4 3 4 Two Main AA Substitution Matrices • Dayhoff PAM Matrix – Aligned closely related proteins (orthologs) to identify amino acid changes that were acceptable to maintaining function. Two Main AA Substitution Matrices • Dayhoff PAM Matrix – Aligned closely related proteins (orthologs) to identify amino acid changes that were acceptable to maintaining function. • BLOSUM Matrix – Developed from large number of conserved amino acid patterns, termed BLOCKS – BLOCKS: conserved, ungapped amino acids identified in related proteins Dayhoff PAM Matrix • Fundamental Assumptions: – mutation at each site is independent of previous changes and other sites- Markov model – each site is equally mutable Dayhoff PAM Matrix Acceptable mutations -conserved function Number of changes of each amino acid into every other amino acid was counted Dayhoff PAM Matrix Acceptable mutations -conserved function Number of changes of each amino acid into every other amino acid was counted – Takes into account the frequency of occurrence of each amino acid (not all amino acids are equally abundant). L A G S V E T K I D R P N Q F Y M H C W 1978 0.085 0.087 0.089 0.070 0.065 0.050 0.058 0.081 0.037 0.047 0.041 0.051 0.040 0.038 0.040 0.030 0.015 0.034 0.033 0.010 1991 0.091 0.077 0.074 0.069 0.066 0.062 0.059 0.059 0.053 0.052 0.051 0.051 0.043 0.041 0.040 0.032 0.024 0.023 0.020 0.014 The frequencies in the middle column are taken from Dayhoff (1978), the frequencies in the right column are taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282) representing a database of observations that is approximately 40 times larger than that available to Dayhoff. From: http://www.lmb.uni-muenchen.de/Groups/Bioinformatics/04/ch_04_3.html Dayhoff PAM Matrix • Results in a 20 X 20 matrix of probabilities for each possible amino acid substitution. Dayhoff PAM Matrix • Initial matrix is derived from very similar proteins. Dayhoff PAM Matrix • Initial matrix is derived from very similar proteins. • However, not all homologs are very similar. Dayhoff PAM Matrix • Initial matrix is derived from very similar proteins. • However, not all homologs are very similar. • Extrapolate to encompass greater divergence by multiplication of original matrix Dayhoff PAM Matrix • Initial matrix is derived from very similar proteins. • However, not all homologs are very similar. • Extrapolate to encompass greater divergence by multiplication of original matrix • Results in a series of PAM matrices representing different levels of similarity. Dayhoff PAM Matrix • Initial matrix is derived from very similar proteins. • However, not all homologs are very similar. • Extrapolate to encompass greater divergence by multiplication of original matrix • Results in a series of PAM matrices representing different levels of similarity. • PAM250,PAM120,PAM80 PAM60 correspond to 20, 40, 50 and 60 percent similarity, respectively. Dayhoff PAM Matrix • Initial matrix is derived from very similar proteins. • However, not all homologs are very similar. • Extrapolate to encompass greater divergence by multiplication of original matrix • Results in a series of PAM matrices representing different levels of similarity. • PAM250,PAM120,PAM80 PAM60 correspond to 20, 40, 50 and 60 percent similarity, respectively. • As proteins being compared decrease in similarity, the numerical value of the PAM matrix should increase. BLOSUM Matrix – Developed from large number of conserved amino acid patterns, termed BLOCKS BLOSUM Matrix – Developed from large number of conserved amino acid patterns, termed BLOCKS – BLOCKS: conserved, ungapped amino acids identified in related proteins BLOSUM Matrix – Developed from large number of conserved amino acid patterns, termed BLOCKS – BLOCKS: conserved, ungapped amino acids identified in related proteins – ~2000 conserved blocks-- thought to act as signatures for families of related proteins. BLOSUM Matrix – Developed from large number of conserved amino acid patterns, termed BLOCKS – BLOCKS: conserved, ungapped amino acids identified in related proteins – ~2000 conserved blocks-- thought to act as signatures for families of related proteins. – BLOSUM 62 matrix derived from BLOCKS exhibiting 62% similarity-- BLOSUM Matrix – Developed from large number of conserved amino acid patterns, termed BLOCKS – BLOCKS: conserved, ungapped amino acids identified in related proteins – ~2000 conserved blocks-- thought to act as signatures for families of related proteins. – BLOSUM 62 matrix derived from BLOCKS exhibiting 62% similarity-– Higher the number, the greater the similarity • opposite of PAM matrix. Matrix Application • ODDs matrix: – Ratio that compares the chance that the mutation represents an authentic evolutionary change (pair found in related proteins) to the chance that the change occurred by random sequence variation (pair found in unrelated proteins) . Matrix Application • Log Odds matrix: – Ratio that compares the chance that the mutation represents an authentic evolutionary change (pair found in related proteins) to the chance that the change occurred by random sequence variation (pair found in unrelated proteins) . – Convert to log scores to simplify score determination (add log scores) Matrix Application • Practical Consequence– Typically do not know the percent similarity until you have an alignment. – Use several different matrices and compare output. Substitution matrix • Used to score alignments. • Positive values: substitution is tolerated. Substitution matrix • Used to score alignments. • Positive values: substitution is tolerated. • Zero: substitution occurs with same frequency as random event. Substitution matrix • Used to score alignments. • Positive values: substitution is tolerated. • Zero: substitution occurs with same frequency as random event. • Negative value: substitution is typically selected against. Expect value (E-value) • Expected number of hits, of equivalent or better score, found by random chance in a database of the size searched. Conserved domains Domain: sequence of amino acids that typically fold to a stable tertiary structure. Many proteins are multidomain. Blast to Psi-Blast • Blast makes use of Scoring Matrix derived from large number of proteins. • What if you want to find homologs based upon a specific gene product? • Develop a position specific scoring matrix (PSSM). PSSM M F W Y G A P V I L C R K E N D Q S T H M 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 A 1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 F 0 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Determine frequency of substitution, and converts to LogOdd score. PSSM INDEL M F W Y G A P V I L C R K E N D Q S T H M 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 A 1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 F 0 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Indel 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Can include a score for permitting insertions and deletions. Perhaps this position is at a turn, where INDELs are common. PSSM • In evaluating (scoring) alignments, PSSM approaches typically: – Reward matches to columns that have conserved amino acids – Penalize mismatches to columns with conserved amino acid more than mismatches in a variable column PSI-BLAST • Input a single query sequence. • Executes a BLAST run. • Program takes significant hits, incorporates matches into a PSSM. • Sequences >98% similar not included (avoid biasing the PSSM). Power of approach: • PSI-BLAST is iterative. • Takes best hits and improves the scoring matrix. Original Blast had 84 hits. Utility of Y Blast • Identify distantly related proteins based upon the profile. • These potential matches may suggest functions. • --Profile adds information only over identified region of similarity. Problem of approach: • PSI-BLAST is iterative. • Takes best hits and improves the scoring matrix. • Investigator must be certain that new hits are correct. • Investigator must be certain region of interest is included in PSSM. Multiple Sequence Alignment Multiple Sequence Alignment (MSA) • Can define most similar regions in a set of proteins – functional domains – structural domains • If structure of one (or more) members is known, may be possible to predict some structure of other members MSA and Sequence Pair Alignment • Dynamic programming - (matrix approach) provides an optimal alignment between two sequences. • Difficult for multiple alignment, because the number of comparisons grows exponentially with added sequences. S e q 2 Optimal alignment Seq 1 How to add a third sequence? Complete all pair-wise comparisons. Each added alignment imposes boundaries on final MSA. Optimal Multiple Sequence Alignment For more than three, problem extends into N dimensional space. Scoring MSA • Add scores derived from pair-wise alignments. • Sum of pairs (SP score). • Gaps-constant penalty for any size of gap. Progressive MSA • Do pair-wise alignment • Develop an evolutionary tree • Most closely related sequences are then aligned, then more distant are added. • Genetic distance - number of mismatched positions divided by the total number of matched positions (gaps not considered). Example • Card Domain Gaps • Clustalw attempts to place gaps between conserved domains. • In known sequences, gaps are preferentially found between secondary structure elements (alpha helices, beta strands). These are equivalent trees A B B A C C C C A B B A Problem with Progressive Alignment: Errors made in early alignments are propagated throughout the MSA Profiles & Gaps • From an MSA, a conserved region identified and a scoring matrix (profile) constructed for that region. • Each position has a score associated with an amino acid substitution or gap. • Blocks- also extracted from MSA, but no gaps are permitted. • Block Server • http://blocks.fhcrc.org/blocks/blocks_search.html • Results Hidden Markov Models • Probabilistic model of a Multiple sequence alignment. • No indel penalties are needed • Experimentally derived information can be incorporated • Parameters are adjusted to represent observed variation. • Requires at least 20 sequences The Evolution of a Sequence • Over long periods of time a sequence will acquire random mutations. – These mutations may result in a new amino acid at a given position, the deletion of an amino acid, or the introduction of a new one. – Over VERY long periods of time two sequences may diverge so much that their relationship can not see seen through the direct comparison of their sequences. Hidden Markov Models • Pair-wise methods rely on direct comparisons between two sequences. • In order to over come the differences in the sequences, a third sequence is introduced, which serves as an intermediate. • A high hit between the first and third sequences as well as a high hit between the second and third sequence, implies a relationship between the first and second sequences. Transitive relationship Introducing the HMM • The intermediate sequence is kind of like a missing link. • The intermediate sequence does not have to be a real sequence. • The intermediate sequence becomes the HMM. Introducing the HMM • The HMM is a mix of all the sequences that went into its making. • The score of a sequence against the HMM shows how well the HMM serves as an intermediate of the sequence. – How likely it is to be related to all the other sequences, which the HMM represents. Match State with no Indels MSGL MTNL B M1 M2 M3 M4 Arrow indicates transition probability. In this case 1 for each step E Match State with no Indels MSGL MTNL B M=1 S=0.5 T=0.5 M1 M2 M3 M4 E Also have probability of Residue at each positon Typically want to incorporate small probability for all other amino acids. MSGL MTNL B M=1 S=0.5 T=0.5 M1 M2 M3 M4 E Permit insertion states MS.GL MT.NL MSANI B I1 I2 I3 I4 M1 M2 M3 M4 Transition probabilities may not be 1 E Permit insertion states MS..GL MT..NL MSA.NI MTARNL B I1 I2 I3 I4 M1 M2 M3 M4 E MS..GL-MT..NLAG MSA.NIAG MTARNLAG DELETE PERMITS INCORPORATION OF LAST TWO SITES OF SEQ1 D1 B D2 D3 D4 D5 D6 I4 I5 I6 I1 I2 I3 M1 M2 M3 M4 M5 M6 E The bottom line of states are the main states (M) These model the columns of the alignment The second row of diamond shaped states are called the insert states (I) These are used to model the highly variable regions in the alignment. The top row or circles are delete states (D) These are silent or null states because they do not match any residues, they simply allow the skipping over of main states. Dirichlet Mixtures • Additional information to expand potential amino acids in individual sites. • Observed frequency of amino acids seen in certain chemical environments – aromatic – acidic – basic – neutral – polar The PSSM will skew towards this region