Download Motif recognition - www.bioinf.org.uk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Catalytic triad wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Gene expression wikipedia , lookup

Community fingerprinting wikipedia , lookup

Expression vector wikipedia , lookup

Magnesium transporter wikipedia , lookup

Proteolysis wikipedia , lookup

Network motif wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Metalloprotein wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Biochemistry wikipedia , lookup

Point mutation wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Sequence analysis:
Macromolecular motif recognition
Sylvia Nagl
DNA sequence
Automatic
translation
Amino acid primary sequence
1. Search for sequence homologue(s)
and construct an alignment
2. Homologue(s) with known 3D
structure?
3. Motif recognition: Search
secondary databases
Secondary structure prediction
Fold assignment
Physico-chemical properties
(e. g.,db
using
EMBOSS suite)
Primary
searches
FASTA, BLAST
Homology modelling
available
Terminology
Terminology
•Motif: the biological object one attempts to model - a
functional or structural domain, active site,
phosphorylation site etc.
•Pattern: a qualitative motif description based on a
regular expression-like syntax
•Profile: a quantitative motif description - assigns a
degree of similarity to a potential match
Active site recognition
EXAMPLE: CATHEPSIN A
PEPTIDASE FAMILY S10
EC # 3.4.16.5
3-D representation
3D profile
(PROCAT)
Active site
motifs
Conserved
seq
patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
Domain recognition
Kringle domain from
plasminogen protein
EGF-like domain from
coagulation factor X
Macromolecular motif recognition
Why search for motifs?
•to find “homologous” sequences
apply existing information to new sequence
find functionally important sites
•to find templates for homology modelling -lecture on
homology modelling
Different analysis methods
Percent identity
Method
100
90
Automatic pairwise
80
Alignment BLAST,
Fasta)
70
60
50
Macromolecular
motif recognition
40
30
20
Twilight zone
10
0
Midnight zone
Structure prediction
Macromolecular motif recognition
What do we need?
•Method for defining motifs
•Algorithm for finding them
•Statistics to evaluate matches
Macromolecular motif recognition
Methods for defining motifs:
•Regular expression (patterns)
•Profiles
•Hidden Markov Model (HMM)
Macromolecular motif recognition
1-D representation: Primary amino acid sequence
MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPE
NSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTE
VAQSNFEALQDFFRLFPEYKNNKL...
Computational
sequence analysis
Query secondary
databases over the
Internet
http://www.ebi.ac.uk/interpro/
Macromolecular motif recognition
single motif
exact regular expression
(PROSITE)
full domain alignment
profile (PROSITE)
multiple motifs
residue frequency
matrices (PRINTS)
Hidden Markov Model
(Pfam, PROSITE)
Active site
motifs
Conserved
seq
patterns
1ac5
438LTFVSVYNASHMVPFDKS455
1ivy
419IAFLTIKGAGHMVPTDKP436
Motif modelling methods
Prosite: Regular expressions
CARBOXYPEPT_SER_HIS
[LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x[IVAQ]-P-x(3)-[PSA]
Regular expressions represent features by logical
combinations of characters. A regular expression defines
a sequence pattern to be matched.
Regular expressions contd.
Basic rules for regular expressions
• Each position is separated by a hyphen “-”
• A symbol X is a regular expression matching itself
• x means ‘any residue’
• [ ] surround ambiguities - a string [XYZ] matches any of the enclosed
symbols
• A string [R]* matches any number of strings that match
• { } surround forbidden residues
• ( ) surround repeat counts
Model formation
•Restricted to key conserved features in order to reduce the “noise” level
•Built by hand in a stepwise fashion from multiple alignments
Regular expressions contd.
Regular expressions, such as PROSITE patterns, are matched to
primary amino acid sequences using finite state automata.
“all-or-none”
Motif modelling methods
Prints: Residue frequency matrices
Motif 1
NPESWTNFANMLW
NPYSWVNLTNVLW
REYSWHQNHHMIY
NEGSWISKGDLLF
NPYSWTNLTNVVY
NEYSWNKMASVVY
NDFGWDQESNLIY
NENSWNNYANMIY
NEYGWDQVSNLLY
NPYAWSKVSTMIY
NPYSWNGNASIIY
NEYAWNKFANVLF
NPYSWNRVSNILY
NPYSWNLIANVLY
NEYRWNKVANVLF
Motif 2
LDQPFGTGYSQ
VDNPVGAGFSY
VDQPVGTGFSL
VDQPGGTGFSS
IDNPVGTGFSF
IDQPTGTGFSV
VDQPLGTGYSY
IDQPAGTGFSP
LESPIGVGFSY
LDQPVGSGFSY
LDQPVGSGFSY
LDQPINTGFSN
LDQPIGAGFSY
LDAPAGVGFSY
LDQPVGAGFSY
Motif 3
FFQHFPEYQTNDFHIAGESYAGHYIP
FFNKFPEYQNRPFYITGESYGGIYVP
WVERFPEYKGRDFYIVGESYAGNGLM
FLSKFPEYKGRDFWITGESYAGVYIP
WFQLYPEFLSNPFYIAGESYAGVYVP
FFEAFPHLRSNDFHIAGESYAGHYIP
FFRLFPEYKDNKLFLTGESYAGIYIP
FLTRFPQFIGRETYLAGESYGGVYVP
FFNEFPQYKGNDFYVTGESYGGIYVP
WMSRFPQYQYRDFYIVGESYAGHYVP
FFRLFPEYKNNKLFLTGESYAGIYIP
FFRLFPEYKNNKLFLTGESYAGIYIP
WLERFPEYKGREFYITGESYAGHYVP
WMSRFPQYRYRDFYIVGESYAGHYVP
WFEKFPEHKGNEFYIAGESYAGIYVP
Motif 4
LAFTLSNSVGHMAP
LQFWWILRAGHMVA
LMWAETFQSGHMQP
LTYVRVYNSSHMVP
LQEVLIRNAGHMVP
LTFVSVYNASHMVP
LTFARIVEASHMVP
LTFSSVYLSGHEIP
IDVVTVKGSGHFVP
MTFATIKGSGHTAE
MTFATIKGGGHTAE
FGYLRLYEAGHMVP
MTFATVKGSGHTAE
ITLISIKGGGHFPA
MTFATVKGSGHTAE
•a collection of protein “fingerprints” that exploit groups of motifs to build
characteristic family signatures
•motifs are encoded in ungapped ”raw” sequence format
•different scoring methods may be superimposed onto the data, e. .g. BLAST
•improved diagnostic reliability
•mutual context provided by motif neighbours
Motif modelling methods
Prosite: Profiles
Feature is represented as a matrix with a score for every
possible character.
Matrix is derived from a sequence alignment, e.g.:
F
F
Y
F
F
L
K
K
P
P
K
E
L
A
I
V
V
F
L
F
V
V
L
I
S
G
G
K
A
S
H
Q
Q
E
A
E
C
T
E
A
V
C
L
M
L
I
I
I
L
F
L
L
A
I
V
Q
G
K
D
Q
Profiles contd.
Derived matrix:
Alignment positions
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
-18
-22
-35
-27
60
-30
-13
3
-26
14
3
-22
-30
-32
-18
-22
-10
0
9
34
-10
-33
0
15
-30
-20
-12
-27
25
-28
-15
-6
24
5
9
-8
-10
-25
-25
-18
-1
-18
-32
-25
12
-28
-25
21
-25
19
10
-24
-26
-25
-22
-16
-6
22
-18
-1
-8
-18
-33
-26
14
-32
-25
25
-27
27
14
-27
-28
-26
-22
-21
-7
25
-19
1
8
-22
-7
-9
-26
28
-16
-29
-6
-27
-17
1
-14
-9
-10
11
-5
-19
-25
-23
-3
-26
6
23
-29
-14
14
-23
4
-20
-10
8
-10
24
0
2
-8
-26
-27
-12
3
22
-17
-9
-15
-23
-22
-8
-15
-9
-9
-15
-22
-16
-18
-1
2
6
-34
-19
-10
-24
-34
-24
4
-33
-22
33
-27
33
25
-24
-24
-17
-23
-24
-10
19
-20
0
-2
-19
-31
-23
12
-27
-23
19
-26
26
12
-24
-26
-23
-22
-19
-7
16
-17
0
-8
-7
0
-1
-29
-5
-10
-23
0
-21
-11
-4
-18
7
-4
-4
-11
-16
-28
-18
Profiles contd.
•inclusion of all possible information to maximise overall
signal of protein/domain
i. e., a full representation of features in the aligned
sequences
•can detect distant relationships with only few well
conserved residues
•position-dependent weights/penalties for all 20 amino
acids -- BASED ON AMINO ACID SUBSTITUTION
MATRICES -- and for gaps and insertions
•dynamic programming algorithms for scoring hits
Macromolecular motif recognition
Pfam and Prosite: Hidden Markov Models
(HMMs)
•Feature is represented by a probabilistic model of
interconnecting match, delete or insert states
•contains statistical information on observed and expected
positional variation - “platonic ideal of protein family”
Di
Ii
B
Mi
E
Macromolecular motif recognition
Pfam and Prosite: Hidden Markov Models
(HMMs)
P of a given amino acid to occurs
in a particular state (M, I, D) - at particular
position in sequence (for all 20, profile-like)
P of transition
state
Di
Ii
B
Mi
E
Statistical significance
•Statistical tests aim to assess the likelihood that a match of a
query sequence to a profile, regular expression, HMM, etc, is
the result of chance.
•They control for such factors as sequence (match) length,
amino acid composition and size of the database searched.
Statistical significance
•log-odds score: this number is the log of the ratio between two
probabilities - P that the sequence belongs to the positive set, and P
that the result was obtained by chance due to the amino acid
distribution in the positive set (random model).
•Z-score: one needs to estimate an average score and a standard
deviation as a function of sequence length. Then, one uses the
number of standard deviations each sequence is away from the
average as the score.
•e-value (Expect value): given a database search result with
alignment score S, the e-value is the expected number of sequences
of score >= S that would be found by random chance.
•p-value: the probability that one or more sequences of score >= S
would have been found randomly.
INTERPRO
•The InterPro database allows efficient searching
•An integrated annotation resource for protein families,
domains and functional sites that amalgamates the efforts of
the PROSITE, PRINTS, Pfam, ProDom, SMART and
TIGRFAMs secondary database projects.
http://www.ebi.ac.uk/interpro