* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Motif recognition - www.bioinf.org.uk
Catalytic triad wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Gene expression wikipedia , lookup
Community fingerprinting wikipedia , lookup
Expression vector wikipedia , lookup
Magnesium transporter wikipedia , lookup
Proteolysis wikipedia , lookup
Network motif wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Metalloprotein wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Biochemistry wikipedia , lookup
Point mutation wikipedia , lookup
Biosynthesis wikipedia , lookup
Sequence analysis: Macromolecular motif recognition Sylvia Nagl DNA sequence Automatic translation Amino acid primary sequence 1. Search for sequence homologue(s) and construct an alignment 2. Homologue(s) with known 3D structure? 3. Motif recognition: Search secondary databases Secondary structure prediction Fold assignment Physico-chemical properties (e. g.,db using EMBOSS suite) Primary searches FASTA, BLAST Homology modelling available Terminology Terminology •Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation site etc. •Pattern: a qualitative motif description based on a regular expression-like syntax •Profile: a quantitative motif description - assigns a degree of similarity to a potential match Active site recognition EXAMPLE: CATHEPSIN A PEPTIDASE FAMILY S10 EC # 3.4.16.5 3-D representation 3D profile (PROCAT) Active site motifs Conserved seq patterns 1ac5 438LTFVSVYNASHMVPFDKS455 1ivy 419IAFLTIKGAGHMVPTDKP436 Domain recognition Kringle domain from plasminogen protein EGF-like domain from coagulation factor X Macromolecular motif recognition Why search for motifs? •to find “homologous” sequences apply existing information to new sequence find functionally important sites •to find templates for homology modelling -lecture on homology modelling Different analysis methods Percent identity Method 100 90 Automatic pairwise 80 Alignment BLAST, Fasta) 70 60 50 Macromolecular motif recognition 40 30 20 Twilight zone 10 0 Midnight zone Structure prediction Macromolecular motif recognition What do we need? •Method for defining motifs •Algorithm for finding them •Statistics to evaluate matches Macromolecular motif recognition Methods for defining motifs: •Regular expression (patterns) •Profiles •Hidden Markov Model (HMM) Macromolecular motif recognition 1-D representation: Primary amino acid sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPE NSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTE VAQSNFEALQDFFRLFPEYKNNKL... Computational sequence analysis Query secondary databases over the Internet http://www.ebi.ac.uk/interpro/ Macromolecular motif recognition single motif exact regular expression (PROSITE) full domain alignment profile (PROSITE) multiple motifs residue frequency matrices (PRINTS) Hidden Markov Model (Pfam, PROSITE) Active site motifs Conserved seq patterns 1ac5 438LTFVSVYNASHMVPFDKS455 1ivy 419IAFLTIKGAGHMVPTDKP436 Motif modelling methods Prosite: Regular expressions CARBOXYPEPT_SER_HIS [LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x[IVAQ]-P-x(3)-[PSA] Regular expressions represent features by logical combinations of characters. A regular expression defines a sequence pattern to be matched. Regular expressions contd. Basic rules for regular expressions • Each position is separated by a hyphen “-” • A symbol X is a regular expression matching itself • x means ‘any residue’ • [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols • A string [R]* matches any number of strings that match • { } surround forbidden residues • ( ) surround repeat counts Model formation •Restricted to key conserved features in order to reduce the “noise” level •Built by hand in a stepwise fashion from multiple alignments Regular expressions contd. Regular expressions, such as PROSITE patterns, are matched to primary amino acid sequences using finite state automata. “all-or-none” Motif modelling methods Prints: Residue frequency matrices Motif 1 NPESWTNFANMLW NPYSWVNLTNVLW REYSWHQNHHMIY NEGSWISKGDLLF NPYSWTNLTNVVY NEYSWNKMASVVY NDFGWDQESNLIY NENSWNNYANMIY NEYGWDQVSNLLY NPYAWSKVSTMIY NPYSWNGNASIIY NEYAWNKFANVLF NPYSWNRVSNILY NPYSWNLIANVLY NEYRWNKVANVLF Motif 2 LDQPFGTGYSQ VDNPVGAGFSY VDQPVGTGFSL VDQPGGTGFSS IDNPVGTGFSF IDQPTGTGFSV VDQPLGTGYSY IDQPAGTGFSP LESPIGVGFSY LDQPVGSGFSY LDQPVGSGFSY LDQPINTGFSN LDQPIGAGFSY LDAPAGVGFSY LDQPVGAGFSY Motif 3 FFQHFPEYQTNDFHIAGESYAGHYIP FFNKFPEYQNRPFYITGESYGGIYVP WVERFPEYKGRDFYIVGESYAGNGLM FLSKFPEYKGRDFWITGESYAGVYIP WFQLYPEFLSNPFYIAGESYAGVYVP FFEAFPHLRSNDFHIAGESYAGHYIP FFRLFPEYKDNKLFLTGESYAGIYIP FLTRFPQFIGRETYLAGESYGGVYVP FFNEFPQYKGNDFYVTGESYGGIYVP WMSRFPQYQYRDFYIVGESYAGHYVP FFRLFPEYKNNKLFLTGESYAGIYIP FFRLFPEYKNNKLFLTGESYAGIYIP WLERFPEYKGREFYITGESYAGHYVP WMSRFPQYRYRDFYIVGESYAGHYVP WFEKFPEHKGNEFYIAGESYAGIYVP Motif 4 LAFTLSNSVGHMAP LQFWWILRAGHMVA LMWAETFQSGHMQP LTYVRVYNSSHMVP LQEVLIRNAGHMVP LTFVSVYNASHMVP LTFARIVEASHMVP LTFSSVYLSGHEIP IDVVTVKGSGHFVP MTFATIKGSGHTAE MTFATIKGGGHTAE FGYLRLYEAGHMVP MTFATVKGSGHTAE ITLISIKGGGHFPA MTFATVKGSGHTAE •a collection of protein “fingerprints” that exploit groups of motifs to build characteristic family signatures •motifs are encoded in ungapped ”raw” sequence format •different scoring methods may be superimposed onto the data, e. .g. BLAST •improved diagnostic reliability •mutual context provided by motif neighbours Motif modelling methods Prosite: Profiles Feature is represented as a matrix with a score for every possible character. Matrix is derived from a sequence alignment, e.g.: F F Y F F L K K P P K E L A I V V F L F V V L I S G G K A S H Q Q E A E C T E A V C L M L I I I L F L L A I V Q G K D Q Profiles contd. Derived matrix: Alignment positions A C D E F G H I K L M N P Q R S T V W Y -18 -22 -35 -27 60 -30 -13 3 -26 14 3 -22 -30 -32 -18 -22 -10 0 9 34 -10 -33 0 15 -30 -20 -12 -27 25 -28 -15 -6 24 5 9 -8 -10 -25 -25 -18 -1 -18 -32 -25 12 -28 -25 21 -25 19 10 -24 -26 -25 -22 -16 -6 22 -18 -1 -8 -18 -33 -26 14 -32 -25 25 -27 27 14 -27 -28 -26 -22 -21 -7 25 -19 1 8 -22 -7 -9 -26 28 -16 -29 -6 -27 -17 1 -14 -9 -10 11 -5 -19 -25 -23 -3 -26 6 23 -29 -14 14 -23 4 -20 -10 8 -10 24 0 2 -8 -26 -27 -12 3 22 -17 -9 -15 -23 -22 -8 -15 -9 -9 -15 -22 -16 -18 -1 2 6 -34 -19 -10 -24 -34 -24 4 -33 -22 33 -27 33 25 -24 -24 -17 -23 -24 -10 19 -20 0 -2 -19 -31 -23 12 -27 -23 19 -26 26 12 -24 -26 -23 -22 -19 -7 16 -17 0 -8 -7 0 -1 -29 -5 -10 -23 0 -21 -11 -4 -18 7 -4 -4 -11 -16 -28 -18 Profiles contd. •inclusion of all possible information to maximise overall signal of protein/domain i. e., a full representation of features in the aligned sequences •can detect distant relationships with only few well conserved residues •position-dependent weights/penalties for all 20 amino acids -- BASED ON AMINO ACID SUBSTITUTION MATRICES -- and for gaps and insertions •dynamic programming algorithms for scoring hits Macromolecular motif recognition Pfam and Prosite: Hidden Markov Models (HMMs) •Feature is represented by a probabilistic model of interconnecting match, delete or insert states •contains statistical information on observed and expected positional variation - “platonic ideal of protein family” Di Ii B Mi E Macromolecular motif recognition Pfam and Prosite: Hidden Markov Models (HMMs) P of a given amino acid to occurs in a particular state (M, I, D) - at particular position in sequence (for all 20, profile-like) P of transition state Di Ii B Mi E Statistical significance •Statistical tests aim to assess the likelihood that a match of a query sequence to a profile, regular expression, HMM, etc, is the result of chance. •They control for such factors as sequence (match) length, amino acid composition and size of the database searched. Statistical significance •log-odds score: this number is the log of the ratio between two probabilities - P that the sequence belongs to the positive set, and P that the result was obtained by chance due to the amino acid distribution in the positive set (random model). •Z-score: one needs to estimate an average score and a standard deviation as a function of sequence length. Then, one uses the number of standard deviations each sequence is away from the average as the score. •e-value (Expect value): given a database search result with alignment score S, the e-value is the expected number of sequences of score >= S that would be found by random chance. •p-value: the probability that one or more sequences of score >= S would have been found randomly. INTERPRO •The InterPro database allows efficient searching •An integrated annotation resource for protein families, domains and functional sites that amalgamates the efforts of the PROSITE, PRINTS, Pfam, ProDom, SMART and TIGRFAMs secondary database projects. http://www.ebi.ac.uk/interpro