Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein homology detection by HMM–HMM comparison Johannes Söding A topic in Sequence analysis Presented by: Giriprasad Sridhara [email protected] CISC 841 Spring 2006 APR 20 2006 Organization of presentation • • • • Introduction Theory Results Conclusion Introduction • Paper Details: – Bioinformatics journal – Vol. 21 no. 7 2005, pages 951–960 • Author Details – Dr. Johannes Söding • Department of Protein Evolution, MaxPlanckInstitute for Developmental Biology, Spemannstrasse 35, D72076 Tübingen, Germany Introduction • Tool Details: – A tool HHPred has been developed. – Described in Nucleic Acid Research, 2005, Vol 33 – A web server is available at http://www.protevo.eb.tuebingen.mpg.de/toolk it/index.php?view=hhpred Introduction • Central theme in bioinformatics: – Homology and sequence alignment • Issues: – Problem of finding a close homolog with known function or structure which would allow to make inferences about the protein under observation. – New and highly sensitive methods could detect and align remotely homologous sequences that provide information about the protein’s function, structure or evolution. Introduction • Methods (Tools) of homology detection: (In increasing order of sensitivity) – Sequence - Sequence • BLAST • FASTA – Profile - Sequence • PSIBLAST – More sensitive since it uses a sequence profile – Profile – Profile • COMPASS • PROF_SIM – Profile - HMM • HMMER – HMM-HMM • HHPred Introduction • Sequence profiles – Built from a multiple alignment of homologous sequences – Contains more information about the sequence family than a single sequence. – Helps to distinguish between • conserved and non-conserved positions • Conserved are important for defining members of the family • Non-conserved are variable among the members of the family. – Describe exactly • what variation in amino acids is possible at each position • Done by recording the probability for the occurrence of each amino acid along the multiple alignment. Introduction • Profile Hidden Markov Models (Profile HMMs) – Similar to simple sequence profiles – have amino acid frequencies as in the columns of a MSA – Also have position specific probabilities for inserts and deletions along the alignment – logarithms of these probabilities =position specific gap penalties – Perform better than sequence profiles in the detection of homologs and in the quality of alignments – Why higher sensitivity? • Position specific gap penalties penalize chance hits much more than true positives – which tend to have insertions or deletions at the same positions as the sequences from which the HMM was built. Introduction • Pictorial representation of profile HMM With M, I and D states. Theory • Align 2 HMM by maximizing a score – Score is log-sum-of-odds score. • What does a path through the 2 HMMs Represent? • A sequence co-emitted by both HMMs • How do we find this path? – Use dynamic programing (Viterbi) – Find path that maximizes log-sum-of-odds score Theory • Advantages of HMM-HMM – Improves both sensitivity and alignment quality • Calibrate the score for additional sensitivity – Use scoring correlation function – Use secondary structure information • Even sequences that are distantly homologous will have similar secondary structures. • This can help distinguish real homologs from chance hits • Biologically, secondary structures diverge more slowly than sequences • This knowledge is utilized. Theory Theory • Additionally to enhance homology detection – Score secondary structure – Use other available additional information (like confidence – term covered later on in the slides) • Tool HHPred – Homology detection & structure prediction – Novelty • HMM-HMM comparison • Scores secondary structure • Reliability measured by – Probability of each match being a true positive – Used since e-values reported by most tools can be inaccurate Theory (Log-sum-of-odds score) • Defined as SLSO log (P (x1, ..., xL | emission on path) P(x1,..., xL | NULL)) x1,.. xL • Numerator – probability that x1,…xL is co-emitted by both HMMs along the alignment path • Denominator – Standard null model probability • Summation – Runs over all sequences of L residues that can be emitted along the alignment path by both HMMs Theory • How do we apply Viterbi algorithm? • Denote – 2 HMMs p and q – Probability of emitting amino acid a in match state i or j is qi(a) and pj(a) – Trans prob = qi(X, X’) and pj(Y,Y’) – X or Y can belong to {M, I or D} – f(a) = fixed background frequency – Let Xk and Yk be states in q and p in the k’th column of the alignment of q with p. – i(k) and j(k) be the corresponding columns from q and p. – qk(l)P (a) and pk(l)P(a) = emission prob from q and p. Theory • Ρ tr is the product of all transition probabilities for the path through p and q • qk(l)P (a) = qi(k) (a) for Xk = M • qk(l)P (a) = f (a) for Xk = I l L ( q SLSO log l 1 x1,.. xL 20 20 L log ... q x11 xL 1 l 1 L 20 log ( q a 1 l 1 S aa P k (l ) P k (l ) P k (l ) ( xl ) ( xl ) ( xl ) p p P k (l ) p lL P k (l ) P k (l ) ( xl ) Ptr ) /( f ( xl )) l 1 ( xl ) Ptr // f ( xl ) ( xl ) / f (a )) log Ptr (qi ( k ), pj ( k )) log Ptr k : XkYk MM 20 Saa(qi, pj ) ColumnScore log qi (a ) pj (a ) / f (a ) a 1 Theory • Column score properties: – Positive when 2 distributions are similar – Negative otherwise – Insert states have vanishing column score • Completely non-conserved, pj(a) = f(a) – 1/f(a) • Weight factor to co-emission probability • For a rare amino acid – f(a) will be low 1/f(a) will be high – Weight of rarer amino acids increases in the score calculation as compared to common amino acids. Theory • Pair-wise alignment of HMMs • Allowed transitions • Dynamic programing matrices for Viterbi Theory • We use 5 DP matrices S xy one for each pair state XY belonging to {MM, MI, IM, DG, GD} • SMM (i, j) = Saa(qi,pj) + max { SMM(i-1,j-1) + log[q i-1(M,M) p j-1(M,M)], SMI(i-1,j-1) + log[q i-1(M,M) p j-1(I,M)] SIM(i-1,j-1) + log[q i-1(I,M) p j-1(M,M)] SDG(i-1,j-1) + log[q i-1(D,M) p j-1(M,M)] SGD(i-1,j-1) + log[q i-1(M,M) p j-1(D,M)] } Theory • SMI (i, j) = max { SMM(i-1,j) + log[q i-1(M,M) p j-1(M,I)], SMI(i-1,j) + log[q i-1(M,M) p j-1(I,I)] } • SDG (i, j) = max { SMM(i-1,j) + log[q i-1(M,D)], SDG(i-1,j) + log[q i-1(D,D) } • Initialize SMM(I,0) = 0 = SMM(0,j) • S LSO = max over last row, col of S MM • Trace back from this cell. Theory • Scoring correlations – Clustering • In an alignment of 2 homologous HMMs – Expect high column scores in » Clusters along the sequence • In an alignment of non-homologous HMMs – Do not Expect any clustering. – The above can help • Differentiate homologous and non-homologous alignments – If l’th pair state of optimum path aligns columns i(l) of q and j(l) of p • Sl = SAA(qi(l), pj(l)) iff l’th pair state = MM, else 0. – Auto-correlation function L d g (d ) SlS l d l 1 Theory • Scoring correlations – Auto-correlation function describes correlation of Sl at a fixed sequence separation d – Expect • if 2 HMMs are homologous – A Positive g(d) for small d. – Add a correction factor S corr w g (d 4 corr ) d 1 – wcorr is found empirically to be 0.1 – The correction factor is added after the best alignment is found. Theory • Scoring secondary structure – Allows to score predicted secondary structure against • Another predicted secondary structure • Or a known secondary structure • Predicted secondary structure vs. known secondary structure. – DSSP used to assign 1 of 7 states of observed secondary structure – PSIPRED used to predict secondary structure states, H, E or C. – Predict secondary structure of all domains in SCOP (filtered to twilight zone) – Compare the PSIPRED predictions with DSSP – Get the count of combination of (σ;ρ,c). • σ belongs to {H,E,C,G,B,S,T} • ρ belongs to {H,E,C} • c belongs to {0,1,…,9} Theory • Scoring secondary structure – Derive 10 3*7 substitution matrices (one for each confidence value) Mss(σ;ρ,c) = log (P (σ;ρ,c)/P(σ)P(ρ,c)) • Let – Column i of HMM q have pred sec struct ρiq and confidence value ciq – Column j of HMM p have known sec struct σjp (Note: known sec struct secondary structure of seed seq of alignment) – Define • Sss(q I p j) = wss Mss(σjp;ρiq ciq) • Empirically Wss is 1/7. • This score is added to amino acid column score Saa(qi, pj) for use in the Viterbi algorithm. Theory • Scoring secondary structure (predicted vs predicted) Mss ( iq , ciq , jp , c jp ) log P( iq , ciq | ) P( jp , c jp | ) P( ) / P( iq , ciq ) P( jp , c jp ) • The above matrix informs – How much more probable is it to get the predictions ρiq ciq and ρjp cjp for a pair of aligned homologous residues than to get them independently of each other. • Sss(q I p j) = wss Mss(ρiq ciq ρjp cjp) – Empirically Wss is 1/7. – This score is added to amino acid column score Saa(qi, pj) for use in the Viterbi algorithm. Results and Discussion • All-against-All comparison with the following similarity search tools: – Sequence-Sequence • BLAST – Profile-Sequence • PSI-BLAST – HMM-Sequence • HMMER – Profile-Profile • COMPASS • PROF_SIM • Test – Input below the twilight zone – Ability to detect remote homologs – Ability to give high-quality alignments. Results and Discussion • Different versions of tool used for better juxtaposition of results – HHSearch 0 • Simple profile-profile comparison • Gap opening penalty = -3.5, Gap Extension = -0.2 • Above used instead of transition prob log – HHSearch 1 • Basic HMM-HMM version – HHSearch 2 • Version 1 + inclusion of correlation score – HHSearch 3 • Version 2 + usage of predicted vs predicted secondary structure – HHSearch 4 • Version 3 + usage of predicted vs known secondary structure Results and Discussion • SCOP (structural classification of proteins) database with filtering for twilight zone used. • Detection of homologs: – Domain in SCOP • Family or superfamily or fold or class – Pair of domains are homologous • If they are members of the same super family • Domains from different classes are classified as nonhomologous • We present a chart of TP vs FP – TP homologous pairs – FP non-homologous pairs. Results and Discussion • The figure shows classical sensitivity in the benchmark test. Results and Discussion • Alternative definition of TP and FP – A pair is a TP • If the domains belong to same SCOP super-family • Or if the seq based alignment gives structural alignment with a “maxSub” score of at least 0.1 – A pair is a FP • If it is from different classes and has 0 MaxSub score – What is MaxSub score? • Informally – Defined such that a value > 0 occurs very rarely by chance – It tells what fraction of the query residues can be superposed structurally with the aligned residues from the other structure. • Formally – Weighted number of aligned pairs that can be superimposed with a maximum distance per pair of 3.5 Angstrom units/number of residues in the query sequence – Pairs with 0 Angstorm deviation wieght 1 – Pairs with 3.5 Angstorm deviation wieght 0.5 Results and Discussion • Plot of TP vs FP with new definition of TP and FP Results and Discussion • Observation – More sensitive tools which use secondary structure (HHSearch 3, 4) improve – Reason • Reclassification of “harder to detect” ones as TP helps the more sensitive tools, since they would detect these. Results and Discussion (Alignment quality) • Sequence alignment assessed by – Looking at the spatial distances between aligned pair of residues • upon superposition of the 3D structures • • 2 scores used. maxSub score – Drawback • • Developer’s score – – – • S Dev = N correct/min (Lq, Lp) N Correct = No of residue pairs that are present in the max subset identified by maxSub Lq and Lp = No of residues in the 2 sequences to be aligned. Modeler’s score – – – • Does not penalize over-prediction S Mod = N correct / L ali L Ali = No of aligned residue pairs in the seq alignment. Does not penalize under-prediction. Balanced score – – S balanced = (S dev + S mod) / 2 Penalizes both under and over prediction Results and Discussion Results and Discussion • HHSearch3 performs the best – Family level • • • • • • Aligns 58% of all pairs with balanced score >= 0.3 1.23 times more than COMPASS 1.28 times more than PROF_SIM 1.34 times more than HMMER 1.57 times more than PSI_BLAST 4.4 times more than BLAST – Super family level • • • • • • Aligns 27% of all pairs with balanced score >= 0.3 1.7 times more than COMPASS 1.9 times more than PROF_SIM 2.2 times more than HMMER 2.9 times more than PSI_BLAST 14 times more than BLAST Results and Discussion • HHSearch3 performs the best – Fold level • • • • • • Aligns 4.5% of all pairs with balanced score >= 0.3 3.3 times more than COMPASS 6.0 times more than PROF_SIM 7.3 times more than HMMER 9.4 times more than PSI_BLAST 63 times more than BLAST – Actually 4.5% at fold level is a lot – Pairs aligned at fold level are deemed nonhomologous by SCOP – So we do not expect any good alignments at all Conclusion • A generalization of HMM – Sequence alignment – Pairwise alignment of profile HMMs • Algorithm to maximize log-sum-of-odds score – Generalization of log-odds score • Increased sensitivity of 5-10% – Due to derivation of novel correlation score • Statistical methods for – Scoring predicted vs known secondary structure – Predicted vs predicted secondary structure – Uses confidence values of secondary structure prediction Conclusion • HHPred – New tool based on the research paper • Benchmarking – With 5 other homology detection tools – Dataset in twilight zone • Results – Improvement in • Sensitivity • Alignment quality Thank you. Have a nice day!