* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Presentation
Survey
Document related concepts
Interactome wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Western blot wikipedia , lookup
Endocannabinoid system wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Clinical neurochemistry wikipedia , lookup
Protein structure prediction wikipedia , lookup
NMDA receptor wikipedia , lookup
Proteolysis wikipedia , lookup
Structural alignment wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Paracrine signalling wikipedia , lookup
Signal transduction wikipedia , lookup
Transcript
Sequence Comparison – Identification of remote homologues Amir Harel Moran Yassour Overview Homologues proteins Protein Sequence comparison BLAST and its improvements PSI-BLAST Homologous Proteins Proteins that share a common ancestor are called homologous. Common three dimensional folding structure Homologous Proteins Homology refers to a similarity that spans an entire folding domain. The difficulty in defining homology Why is homology important? Prediction of protein’s properties Classification of proteins to families Evolution tree How to identify homology? Using sequence similarities Aligning two proteins Giving a score to the alignment Global & Local Alignments Global alignment – alignment of the entire sequence Local alignment – alignment of a segment of the sequence How to score an alignment Substitution Matrix – Sij = a value proportional to the probability that amino acid i mutated into amino acid j Types of Substitution Matrices PAM – comparison of closely related sequences BLOSUM – multiple alignments of distantly related sequences Substitution Matrices Different matrices reflect different evolutionary distances: 1 PAM represents the evolutionary distance of 1 amino acid substitution per 100 amino acids. BLOSUM X: all sequences with a similarity higher than X were summarized into one Gap costs The most widely used Gap score is -(a+bk) for a gap of length k. Long gaps do not cost much more than short ones since a single mutation may cause a large gap. Basic Sequence Comparison Smith & Waterman (1981) – dynamic programming of sequence comparison n O(mn) m Complexity issue When DBs become larger, m grows Time complexity Space complexity Intuition to Solution Go over less than the whole matrix Put the spotlight on segments that can be a part of the best path and extend them. The best path is close to a diagonal n Less than O(mn) m Heuristic procedures Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer. There is no guarantee to find the best match. BLAST – Basic Local Alignment Search Tool BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n) Each hit is extended in both directions as long as the score hasn’t dropped too much. BLAST x x x x x x - - x x - x x x x x x x - x x x - x x - x - x x x x - x - x x x - x x x x x x - x x x - x x x x - x x x x - x x x x - A word about the parameter T Small T: greater sensitivity, more hits to expand large T: lower sensitivity, fewer hits to expand Gapped BLAST The original BALST was un-gapped Soon after came gapped BLAST BLAST - Results P value – The probability of an alignment occurring with score S or better. E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance. Lower E value –> more significant score. E-value and Homology Non significant score does not necessarily imply non-homology: E-value and Homology Use it wisely Choose your Substitution Matrix Choose your DB Example 1 – remote homology Frequently, identification of a remote homology will require several database searches. The glutathione transferase family Remote homology Remote homology Testing the possibility that elongation factors share homology with glutathione S-transferases : There is a clear relationship between this elongation factor and the class-theta glutathione transferases. Example 2 - mapping Three different families of G-protein coupled receptors: the R family (the largest) the C/S family the G receptor family Finding links between families Name OPSD_HUMAN RHODOPSIN. OPSG_CHICK GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO OPSG_HUMAN GREEN-SENSITIVE OPSIN (GREEN CONE PHOTO OPS1_DROME OPSIN RH1 (OUTER R1-R6 PHOTORECEPTOR CE NK2R_MOUSE SUBSTANCE-K RECEPTOR (SKR) (NEUROKININ SSR5_HUMAN SOMATOSTATIN RECEPTOR TYPE 5. TXKR_HUMAN PUTATIVE TACHYKININ RECEPTOR. 5H7_HUMAN 5-HYDROXYTRYPTAMINE 7 RECEPTOR (5-HT-7) CKR1_HUMAN C-C CHEMOKINE RECEPTOR TYPE 1 (C-C CKRETBR_RAT ENDOTHELIN B RECEPTOR PRECURSOR (ET-B) (E AA2B_RAT ADENOSINE A2B RECEPTOR. MAS_MOUSE MAS PROTO-ONCOGENE. PAFR_MACMU PLATELET ACTIVATING FACTOR RECEPTOR (PA OLF2_RAT OLFACTORY RECEPTOR-LIKE PROTEIN F12. MAS_RAT MAS PROTO-ONCOGENE. CAR1_DICDI CYCLIC AMP RECEPTOR 1 OLF2_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR2. CAR3_DICDI CYCLIC AMP RECEPTOR 3. MAS_HUMAN MAS PROTO-ONCOGENE. OLF1_CHICK OLFACTORY RECEPTOR-LIKE PROTEIN COR1. PER2_MOUSE PROSTAGLANDIN E RECEPTOR, EP2 SUBTYPE. Score E-value 2347 0 1791 0 1002 0 527 3.10E-30 435 1.10E-23 431 1.50E-23 419 3.50E-22 283 6.40E-14 280 8.50E-14 278 1.50E-13 276 1.60E-13 133 130 135 131 130 129 124 120 117 121 0.007 0.007 0.009 0.01 0.01 0.02 0.05 0.06 0.17 0.23 Finding links between families Name CAR1_DICDI CYCLIC AMP RECEPTOR 1. CAR3_DICDI CYCLIC AMP RECEPTOR 3. CAR2_DICDI CYCLIC AMP RECEPTOR 2. CALR_HUMAN CALCITONIN RECEPTOR PRECURSOR (CT-R). IL8B_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B CLRA_RAT CALCITONIN RECEPTOR A PRECURSOR (CT-R-A) CLRB_RAT CALCITONIN RECEPTOR B PRECURSOR (CT-R-B) DIHR_MANSE DIURETIC HORMONE RECEPTOR (DH-R). CALR_PIG CALCITONIN RECEPTOR PRECURSOR (CT-R). GLR_RAT GLUCAGON RECEPTOR PRECURSOR (GL-R). IL8B_RABIT HIGH AFFINITY INTERLEUKIN-8 RECEPTOR B RDC1_HUMAN G PROTEIN-COUPLED RECEPTOR RDC1 HOMOLOG G10D_RAT PROBABLE G PROTEIN-COUPLED RECEPTOR G10D OPSD_HUMAN RHODOPSIN. VIPR_HUMAN VASOACTIVE INTESTINAL POLYPEPTIDE RECEP OPSD_SPHSP OPSIN. SCRC_RAT SECRETIN RECEPTOR PRECURSOR. IL8A_HUMAN HIGH AFFINITY INTERLEUKIN-8 RECEPTOR A GLPR_RAT GLUCAGON-LIKE PEPTIDE 1 RECEPTOR PRECURSO AG2S_XENLA TYPE-1-LIKE ANGIOTENSIN II RECEPTOR 2 Family Score 2678 1524 1497 C/S 167 R 161 C/S 162 C/S 162 C/S 150 C/S 145 C/S 145 R 141 R 139 R 133 R 130 C/S 131 R 129 C/S 129 R 127 C/S 143.1 R 126 E-value 0 0 0 0.00042 0.00073 0.00087 0.00095 0.0045 0.012 0.012 0.016 0.022 0.061 0.085 0.098 0.11 0.13 0.14 0.16 0.16 Building Proteins tree Conclusions Searches with high-scoring, related or unrelated sequences, is a very important tool. Homology is a transitive relation… BLAST – Pros & Cons Pros: It works Cons: Statistical evaluations rather than biological one. Converged Evolution Weak but biologically relevant similarities may be overlooked (PSI will improve this issue) BLAST improvements Running time improvements : Two-hit method Seed extension PSI-BLAST The two-hit method The extension step accounts for more than 90% of BLAST’s execution time Invoke an extension only when two nonoverlapping hits are found within a certain distance of one another The two-hit method x x x x x x x x x - x x - x x - x x x x x x x x x - x x x - x x - - - - - x - - x - - - x x - - - - x - - - - - x - - x - - - - - - x - - x - -second - - xhitx - - - - x - - - - - - - x - - x x x - two-hit - - - -extension x - - x - - - - - - x - - - - - x - - - x - - - - - - x - first - -hit - - x - - - x x - - - - x - - - - - x - - x - - - - - - - - - - - x x x - - - - - x - - x - - - - - x - - x - - - x - - - x - - - - - - x x - - - - x - - x - - - x - - - - x x x x x - Seed Extension PSI-BLAST Evolution pressure Needle in a hey stack PSI-BLAST comes to solve this problem Evolution reveals itself Giving more significance to the conserved areas and to ignoring the background noises PSI-BLAST = Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM Position-Specific Matrix - PSSM Pij = proportional to the probability of finding the ith amino acid in the jth position in these sequences PSSM Represents the distribution of the amino acids in each position in a collection of sequences Steps in the PSI-BLAST Initiation: Running gapped BLAST on the query, outputting a collection of matching sequences Iteration: Constructing the PSSM based on the best sequences in this collection The PSSM is compared to the protein DB, again, seeking alignments PSI-BLAST Example We start with an uncharacterized protein – MJ0414 When submitting the query we set the E-value threshold to 0.01 (higher than usual) Result of initial gapped BLAST First iteration – Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005 Second iteration – Interpretation of the results Considering a strong unrelated protein will shift the PSSM to its direction E-values retrieved in later iterations should not be taken as automatic proof of homology Was the ligase a right choice? PSI-BLAST Conclusions Uncovers protein relationships missed by single-pass database-search methods Errors are easily amplified by iterations. PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret Running time evaluation Smith Waterman Normalized Running time 36 Original BLAST 1.0 Gapped BLAST 0.34 PSI BLAST 0.87 Running time can be highly influenced by modifying parameters Future Improvements Accepting PSSM as input from other programs Realignment – improve the alignment before going over the DB Automatic domain recognition Summary In BLAST use multiple searches for maximum knowledge BLAST improvements are considerably faster, and enhance significantly the abilities of DB search For many queries the PSI BLAST can greatly increase sensitivity to weak, but biologically relevant sequence relationships Questions time Thank You References Pearson WR. (1997) Identifying distantly related protein sequences. Comput Appl Biosci., 13, 325-332 Altschul SF, Massen TL, Shaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402 Altschul SF, Koonin EV. (1998) Iterated profile searches with PSI-BLAST – a tool for discovery in protein databases. Trends Biochem Sci., 23, 444-447 Sites http://www.ncbi.nlm.nih.gov/BLAST http://www.cs.huji.ac.il/~cbio http://www.people.virginia.edu/~wrp/ http://www-lmmb.ncifcrf.gov/ Appendix - Statistics S' S ln k ln 2 N E S' 2 N nm N S ' log 2 E