Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Outline Sequence-comparison methods Gerard Kleywegt Uppsala University Why compare sequences? Dotplots Pairwise sequence alignments Multiple sequence alignments Profile methods Outline Pairwise sequence alignments II Scoring matrices – Dayhoff – BLOSUM LACTAL |ID +SIM LYSOZY KQFTKCELSQLLK--DIDGYGGIALPELICTMFHTSGYDTQAIVEN-DESTEYGLFQISN | | +|||+ +| +| | | +| +| | ++||| +| | || ||++||++ KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS LACTAL |ID +SIM LYSOZY KLWCKSSQVPQSRNICDISCDKFLDDDITDDIMCAKKIL-DIKGIDYWLAHKALCT-EKL + || | |||+|+| | +| ||| + |||||+ | |++ |+| + | + RWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDV LACTAL |ID +SIM LYSOZY EQWLCE-KL + |+ +| QAWIRGCRL Scoring matrices So far, we have used very simple scoring schemes such as: – Match = +3 – Mismatch = -1 May be good enough for DNA, but not for proteins – Ex: leucine much more likely to be replaced by an isoleucine than by a glutamate – Why is this??? Fast pairwise methods – FASTA – BLAST Databases Assessing significance Amino-acid properties Ex: His = Positive Charged Polar Aromatic Hydrophobic The more barriers between two residues, the less similar (Taylor, 1986) 1 Scoring matrices Substitution matrices More Given 20 amino acid types, a substitution matrix is a symmetric 20 x 20 matrix Why is it symmetric? sophisticated scoring matrices for amino acids have been based on: – Minimal number of base changes required to convert the codon of one amino acid into another (Genetic Code Matrix) – Similarity of their physico-chemical properties – Observed residue equivalences in aligned protein structures – Observed substitutions in aligned sequences substitution matrices Substitution matrices The matrix elements are the “log odds” ratios of substitution Odds ratio R(x,y) = (probability that residues x and y are aligned given that they evolved from each other or a common ancestor) / (probability of aligning x and y by chance) – Usually not known if X replaced Y or vice versa (or if both derive from an ancestral residue Z) How many unique elements does a substitution matrix contain? – 20*19/2 = 190 off-diagonal – 20 diagonal – 210 in all Substitution matrices – Ex: “x” = valine, “y” = leucine R(x,y) = P(x,y|related) / P(x,y|chance) P(x,y|related) = observed empirically P(x,y|chance) = f(x) . f(y) – f(z) = frequency of residue type z in the population (database/proteome/…) Actual matrix elements: – Take 10log, multiply by 10, round to nearest integer – S(x,y) = nint (10 x 10log(R(x,y))) Substitution matrices Another example: – P(x,y|related) = 0.0025 (observed) – f(x) = 0.1 and f(y) = 0.05 – R(x,y) = 0.0025 / (0.1*0.05) = 0.0025 / 0.005 = 0.5 – Thus: for a pair of homologous proteins, x and y are 2 times less likely to be aligned than one would expect by chance – S(x,y) = nint (10 log(0.5)) = nint (10 * -0.301) = nint (-3.01) = -3 – S(y,x) = S(x,y) = -3 Example: – P(x,y|related) = 0.03 (observed) – f(x) = 0.1 and f(y) = 0.05 – R(x,y) = 0.03 / (0.1*0.05) = 0.03 / 0.005 = 6 – Thus: for a pair of homologous proteins, x and y are 6 times more likely to be aligned than one would expect by chance – S(x,y) = nint (10 log(6)) = nint (10 * 0.778) = nint (7.78) = 8 – S(y,x) = S(x,y) = 8 Dayhoff matrices To generate a substitution matrix we need to measure P(x,y|related) Dayhoff et al. were the first to do this (late 1960s) Analysed alignments of closely related sequences (<1% mutations) Resulting matrices are called: – Dayhoff matrices – MDM (mutation data matrix) – PAM (point/percent accepted mutation) 2 Dayhoff matrices Dayhoff matrices Analysis yielded PAM1 matrix – Suitable for comparing proteins with less than 1% mutations PAM2 matrix is obtained by multiplying PAM1 with itself – PAM2 = PAM1 PAM1 = (PAM1)2 – S(I,L) ~ S(I,A)S(A,L) + S(I,C)S(C,L) + .. – Suitable for comparing proteins with ~2% mutations %SI 75 60 50 25 20 PAM 30 80 110 200 250 PAM250 means 250 mutations per 100 residues - how can they still have %SI ~20%? – Ex: AGSTV (4 mutations, 1 difference) – Ex: ILI (2 mutations, no difference) PAM3 = PAM2 PAM1 = (PAM1)3 PAM250 = (PAM1)250 Dayhoff matrices Appropriate PAM matrix to use depends on (expected) level of sequence divergence (identity) BLOSUM matrices PAM LL LI 2 7 -9 5 7 -6 10 7 -4 30 7 -1 80 6 1 250 6 2 500 7 4 Dayhoff matrices – Based on explicit model of evolution – Based on global alignments of closely related proteins – Based on a very small sample of sequences (only ~1500 observed substitutions; few W) BLOSUM matrices – Henikoff & Henikoff, 1992 – Based on observed substitutions in conserved (gap-less) blocks of aligned sequences from many protein families (i.e., local) – Turn out to work better than Dayhoff matrices BLOCKS example BLOSUM matrices Block PR00178A BLOSUM62 ID FATTYACIDBP; BLOCK AC PR00178A; distance from previous block=(2,27) DE Fatty acid-binding protein signature BL adapted; width=23; seqs=85; 99.5%=1111; strength=1324 MYP2_BOVIN|P02690 ( 4) FLGTWKLVSSENFDEYMKALGVG 12 MYP2_RABIT|P02691 ( 4) FLGTWKLVSSENFDDYMKALGVG 13 FABH_BOVIN|P10790 ( 4) FVGTWKLVDSKNFDDYMKSLGVG 16 FABH_HUMAN|P05413 ( 4) FLGTWKLVDSKNFDDYMKSLGVG 16 […] FABI_MOUSE|P55050 ( 2) FDGTWKVDRNENYEKFMEKMGIN 42 FABI_XENLA|Q91775 ( 2) FDGTWKVDRSENYEKFMEVMGVN 44 FABL_HALBI|P81653 ( 2) FSGTWQVYSQENIEDFLRALSLP 87 FAB2_MANSE|P31417 ( 3) LGKVYSLVKQENFDGFLKSAGLS 96 FAB1_MANSE|P31416 ( 4) LGKVYKFDREENFDGFLKSIGLS 59 // good general-purpose matrix (see hand-out) – Based on aligned blocks with at least 62% sequence identities – Comparable to PAM120 – Default for many programs – Gap penalty (example): -11 - L BLOSUM80 ~ PAM1 BLOSUM45 ~ PAM250 3 Substitution matrices Percentage What are the %-ages SI and sequence similarity (using BLOSUM62) for the following alignment? KQFTKCELSQLLK--DIDGYGG KVFGRCELAAAMKRHGLDNYRG Fast pairwise methods KQFTKCELSQLLK--DIDGYGG = = +===+ += += = = KVFGRCELAAAMKRHGLDNYRG sequence similarity – Count number of aligned residues whose score in the substitution matrix is greater than zero – Divide by length of shortest sequence Substitution matrices Needleman-Wunsch-Sellers and SmithWaterman are guaranteed to find an optimal alignment But they are too slow if you want to compare a sequence against a database with thousands or millions of sequences Faster (50-100*), heuristic methods have been developed for this purpose Identities: 9 Similarities: 4 (+ 9) Length of shortest sequence: 20 %SI = 100% * 9 / 20 = 45% Similarity = 100% * (9+4) / 20 = 65% FASTA Pearson & Lipman, 1988 Method (grossly simplified!) – Find identical “k-tuples” (k=1 or 2 for proteins, 4-6 for nucleic acids) in both sequences – Extend these segments to include similar residues – Impose window to limit insertions and deletions – Select the 10 highest scoring segments within the window – Limited dynamic programming to join segments within the window – Cut (reasonable) corners, so not guaranteed to find optimal solution – FASTA (global sequence alignment) – BLAST (local sequence alignment) BLAST Altschul et al., 1990 Basic Local Alignment Search Tool The workhorse of bioinformatics Various versions, e.g. – BLASTP - protein sequence versus protein database – BLASTN = nucleic acid sequence versus nucleic acid database – TBLASTX = translated nucleic acid sequence versus translated nucleic acid database 4 BLAST BLAST algorithm Method (grossly simplified!) – Generate all 3-tuples of the sequence – For each of them, find all 3-tuples that are similar • Use BLOSUM62 and a cut-off value (e.g., 13) • Ex: sequence LSPDGHD… LSP scores 15 with itself – ISP and MSP score 13 include – LTP, VSP, LNP, LAP score 12 ignore – Locate the 3-tuples in each database sequence – Find matching 3-tuples that lie nearby on the same diagonal – Extend these segments – Limited dynamic programming to join high-scoring segments Databases Databases Large, Why central repositories of sequences is it not a good idea to make sequence data (or updates) only available through specialised websites? More and more sequence data published directly on the web – Limits user base • Non-specialists may not know about all webbased resources – e.g., in organism-specific databases – Limits lifetime of the data • Half-life of random websites is only ~2 years Why is it important to also deposit sequences in central repositories? – Limits biological context of the data • No comparisons to other sequences possible – Limits quality/coverage of central database Databases Nucleotide sequences Databases • ~47 million sequences (August, 2005) – EMBL • ~55 million sequences (August, 2005) – DDBJ – Together > 100 gigabases (August, 2005) Protein sequences – UniProt • Swiss-Prot + TrEMBL + PIR • 2,738,790 sequences (January, 2006) – GenPept • Translated Genbank, EMBL, … • 3,230,559 sequences (January, 2006) “Data explosion” – Driven by largescale sequencing efforts – GenBank GenBank – DNA sequences – NCBI (NIH) – ~150x growth from 1995 to 2005 5 Databases Databases When searching for related sequences – TP = true positive = related & retrieved – TN = true negative = not related & not retrieved – FN = false negative = related & not retrieved – FP = false positive = not related & retrieved Same Growth of UniProtKB/TrEMBL – Annotated/translated protein sequences (EBI) Databases concepts used to assess performance of machine-learning methods Databases Performance measures (values 0…1) – Sensitivity = recall = TP / (TP + FN) Example: search with sequence A • More sensitive if fewer FN (but maybe many FP) • Fraction related sequences that is retrieved Sequences Not retrieved retrieved – Selectivity = precision = TP / (TP + FP) • More selective if fewer FP (but maybe many FN) • Fraction retrieved sequences that is related – F-measure = 2 * Prec * Rec / (Prec + Rec) • Harmonic mean of the two measures Databases Sensitivity 123 Not related 19 TP FP 12 8183 FN TN Databases = TP/(TP+FN) = 0.91 – Probability that a related sequence will be retrieved – A.k.a. recall Selectivity Related to A = TP/(TP+FP) = 0.87 – Probability that a retrieved sequence is related – A.k.a. PPV (Positive Predictive Value) – A.k.a. precision F-measure Specificity = TN/(TN+FP) = 0.998 – Probability that a non-related sequence will not be retrieved NPV = TN/(TN+FN) = 0.999 – Negative predictive value – Probability that a non-retrieved sequence will be non-related = 0.89 6 Databases Databases Summary Test positive Test negative Statistics Example Retrieved Not retrieved Statistics Property true TP FN Sensitivity Recall Related 9 1 ? Property false FP TN Specificity Not related 11 99 ? Statistics Selectivity Precision PPV NPV F-measure Statistics ? ? ? Statistic = true / (true + false) Databases Significance Example Retrieved Related Not related Statistics 9 11 9/20 = 0.45 Not retrieved Statistics 1 9/10 = 0.9 99 99/110 = 0.9 99/100 = 0.99 (2 x 0.9 x 0.45) / (0.9 + 0.45) = (1.8 x 0.45) / (3 x 0.45) = 0.6 Alignment of the random sequences generated previously in class (2006) What level of sequence identity do you think they have? PojK CGAGTTTTCGGCGTCTATCTT TjeJ TAAACACAAGGCTACACA PojK RVFGVYL TjeJ (STOP)TQG-YT Significance Significance 33 %? PojK ------CGAGTTTTCGGCGTCTATCTT | || | | | TjeJ TAAACACAAGGCTAC-ACA-------- 44 %? PojK CGAGTTTTCGGCGT-CTATCTT | ||| | | | | TjeJ TAAACACAAGGC-TAC-A-C-A How about 56 %? PojK ----CGAGTTTTC--GGCGT-CTATC-TT | | | ||| | | | | TjeJ TAAAC-A-----CAAGGC-TAC-A-CA-- Not bad for a pair of random sequences Composition: – PojK: 2A, 5C, 5G, 9T – TjeJ: 9A, 5C, 2G, 2T – Common: 2A, 5C, 2G, 2T = 11/18 = 61% 7 Significance Significance Given Z-scores two sequences and a scoring scheme, it is always possible to find the optimal alignment – But is it statistically significant? – Or biologically meaningful? Aligning two random DNA sequences without gaps, one would expect %SI ~25%, and as much as 50% with gaps Ditto, for proteins, %SI ~5% – Align two sequences, note score – Randomise one of the sequences, align, note score (e.g., 100 or 1000 times, to get average and standard deviation) – Alternatively, align it to a random sample of sequences from a database – Z-score = (Score - Average) / St.dev. Significance Significance Z>15: almost certainly homologous probably homologous Z<5:??? Optimal Significance Significance Z=5-15: local alignment scores of unrelated protein sequences follow an extreme-value distribution (EVD) Compare: the length of the tallest person in each house in a country Different from normal distribution: righthand tail decays slower EVD Assuming a normal distribution would over-estimate the significance of local alignments – Probability of obtaining an alignment score greater than “x” by chance is: P(S>x) = 1 - exp (-exp(-λ(x-u))) – u = characteristic value = Kmn/ λ – λ = decay factor, K = constant – m, n = sequence lengths – K and λ can be calculated from the substitution matrix and the relative aminoacid frequencies 8 Significance EVD allows analytical calculation of the probability of exceeding a certain alignment score by chance – p-values – p = 0.01 means: 1 in 100 unrelated sequences gives at least this score – In a database scan against 106 sequences, this would retrieve 10,000 false positives Expectation value: how many matches with at least a certain score are expected by chance in a database of N sequences – E-values – E=p*N – Typical cut-off for BLAST searches: E < 0.01 Significance A search in a database with 1,000 sequences gives two hits: – One has E=10-5 – The other has p=10-6 Which hit is more significant? Why? = pN E(other)=10-6 . 103 = 10-3 The hit with E=10-5 is more significant E 9