Download Protein Sequence Analysis - Bioinformatics Webportal

Document related concepts
no text concepts found
Transcript
Protein Sequence Motifs
Aalt-Jan van Dijk
Plant Research International, Wageningen UR
Biometris, Wageningen UR
[email protected]
www.bioinformatics.nl
Plant Bioinformatics


Genomics





•
•
Next Generation Sequencing
Genome assembly & annotation
(Comparative) genome analysis
SNP analysis, marker development






Computational infrastructure
Database development
Webbased analysis tools
Software- development
Workflow management systems
machine learning

Data (pre-)processing pipelining
Alternative splicing
Protein interactions networks
Metabolomics
•
•
•

Alternative splicing
EST analysis
Proteomics
•
•
•
Technology

Integrated analysis of omics datasets
 Transcriptomics
Database- development
Data (pre-)processing pipelining
Metabolite and pathway-identification
Systems biology

network modelling (bottom-up)
• Protein interactions networks
www.bioinformatics.nl
www.bioinformatics.nl
My research

Protein complex structures
 Protein-protein docking
 Correlated mutations

Interaction site
prediction/analysis
 Protein-protein interactions
 Protein-DNA interactions

Motif search
 Enzyme active sites
www.bioinformatics.nl
www.bioinformatics.nl
Overview

Protein Motif Searching
Hydrophobicity & Transmembrane Domains
Protein Interactions
Sequence-motifs to predict interaction sites

Secondary Structure Prediction



www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching
www.bioinformatics.nl
What is a motif?


A motif is a description of a particular element of
a protein that contains a specific sequence
pattern
Motifs are identified by



3D structural alignment
Multiple sequence alignment
Pattern searching programs
www.bioinformatics.nl
www.bioinformatics.nl
What is a motif?


A motif is a description of a particular element of
a protein that contains a specific sequence
pattern
Motifs are identified by



3D structural alignment
Multiple sequence alignment
Pattern searching programs
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching

Strict consensus pattern

use only strictly conserved residues
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
C
CxxxxxCxxxPxxxxxC
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching

Strict consensus pattern

use only strictly conserved residues
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
C
CxxxxxCxxxPxxxxxC
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching

Strict consensus pattern


use only strictly conserved residues
But what about:


variable residues?
gaps?
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
C
CxxxxxCxxxPxxxxxC
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching

Strict consensus patterns contain




no alternative residues
no flexible regions
no mismatches
no gaps
C--QASCDGIPLKMNDC
C---VTCEGLPMRMDQC
CERTLGCQPMPVH---C
CxxxxxCxxxPxxxxxC
C
C
P
C
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching


Most motifs defined as regular expressions
Motifs can contain


alternative residues
flexible regions
C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C
CXXXCXGXPXXXXXC
|
| | |
|
FGCAKLCAGFPLRRLPCFYG
www.bioinformatics.nl
www.bioinformatics.nl
The PROSITE Syntax

A-[BC]-X-D(2,5)-{EFG}-H






A
B or C
anything
2-5 D’s
not E, F, or G
H
www.bioinformatics.nl
www.bioinformatics.nl
PROSITE entries

Mandatory motifs characterise a protein (super-)
family
ID SUBTILASE_ASP; PATTERN.
DE Serine proteases, subtilase family, aspartic acid active site.
PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH].
ID SUBTILASE_HIS; PATTERN.
DE Serine proteases, subtilase family, histidine active site.
PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM].
ID SUBTILASE_SER; PATTERN.
DE Serine proteases, subtilase family, serine active site.
PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG].
www.bioinformatics.nl
www.bioinformatics.nl
Exercise




Find the three subtilase motifs in prosite
(prosite.expasy.org)
Compare the lists of proteins in which the motifs
occur – what does this tell you?
Similarly, compare protein structures in which the
motifs occur
Have a look at the “sequence logo”
www.bioinformatics.nl
www.bioinformatics.nl
Protein Motif Searching

Some motifs occur frequently in proteins; they
may not actually be present, such as

Post-translational modification sites
ID
DE
PA
ASN_GLYCOSYLATION; PATTERN.
N-glycosylation site.
N-{P}-[ST]-{P}.
www.bioinformatics.nl
www.bioinformatics.nl
Exercise

Use a glycosylation site predictor such as
http://www.cbs.dtu.dk/services/NetNGlyc/

Input: your favorite set of sequences

Do you observe that some N-{P}-[ST] sites are likely to
be glycosylated and others not?
www.bioinformatics.nl
www.bioinformatics.nl
Profiles



Many motifs cannot be easily defined using
simple patterns
Such motifs can be defined using profiles
A profile is constructed from a multiple sequence
alignment. For each position, each amino acid is
given a score depending on how likely it is to
occur
www.bioinformatics.nl
www.bioinformatics.nl
Calculating a Profile


For each alignment position: take the
(weighted) average of the appropriate rows
from the scoring matrix
An (extremely
simple) example:
www.bioinformatics.nl
seq_01
seq_02
seq_03
seq_04
seq_05
seq_06
seq_07
seq_08
seq_09
seq_10
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
W
A
A
A
A
A
A
A
A
W
W
A
A
A
A
A
A
A
W
W
W
A
A
A
A
A
A
W
W
W
W
A
A
A
A
A
W
W
W
W
W
A
A
A
A
W
W
W
W
W
W
A
A
A
W
W
W
W
W
W
W
A
A
W
W
W
W
W
W
W
W
A
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
W
www.bioinformatics.nl
Excerpt from the EBLOSUM62 matrix:
A R N D C Q E G H I L K M F P S T W Y V
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3
A
4.0
N
-2.0
C
0.0
P
-1.0
D
-2.0
Q
-1.0
E
-1.0
R
-1.0
F
-2.0
S
1.0
G
0.0
T
0.0
H
-2.0
V
0.0
I
-1.0
W
-3.0
K
-1.0
Y
-2.0
L
-1.0
M
-1.0
A
5A+5W: 1.0
N
-6.0
C
-2.0
P
-5.0
D
-6.0
Q
-3.0
E
-4.0
R
-4.0
F
-1.0
S
-2.0
G
-2.0
T
-2.0
H
-4.0
V
-3.0
I
-4.0
W
8.0
K
-4.0
Y
0.0
L
-3.0
M
-2.0
A
-3.0
N
-4.0
C
-2.0
P
-4.0
D
-4.0
Q
-2.0
E
-3.0
R
-3.0
F
1.0
S
-3.0
G
-2.0
T
-2.0
H
-2.0
V
-3.0
I
-3.0
W
11.0
K
-3.0
Y
2.0
L
-2.0
M
-1.0
10A:
10W:
prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62
matrix;
www.bioinformatics.nl
www.bioinformatics.nl
Pattern Searching

Short linear motifs: e.g.
http://dilimot.russelllab.org/
Profiles: meme
http://meme.sdsc.edu/meme/cgi-bin/meme.cgi

www.bioinformatics.nl
www.bioinformatics.nl
Exercise
Use a number of sequences wich contain the
prosite subtilase motif and find motifs in those
sequences with MEME
www.bioinformatics.nl
www.bioinformatics.nl
Hydropathy Plot
Prediction hydrophobic and hydrophilic regions in a
protein
www.bioinformatics.nl
Partition Coefficients
Hydrophilic
Hydrophobic
Oil
Water
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity/Hydrophilicity Values
hydrophilic
hydrophobic
R
K
D
Q
N
E
H
S
T
P
Y
C
G
A
M
W
L
V
F
I
Fauchere & Pliska
-1.37
-1.35
-1.05
-0.78
-0.85
-0.87
-0.40
-0.18
-0.05
0.12
0.26
0.29
0.48
0.62
0.64
0.81
1.06
1.08
1.19
1.38
www.bioinformatics.nl
Kyte & Doolittle
-4.50
-3.90
-3.50
-3.50
-3.50
-3.50
-3.20
-0.80
-0.70
-1.60
-1.30
2.50
-0.40
1.80
1.90
-0.90
3.80
4.20
2.80
4.50
Hopp & Woods
3.00
3.00
3.00
0.20
0.20
3.00
-0.50
0.30
-0.40
0.00
-2.30
-1.00
0.00
-0.50
-1.30
-3.40
-1.80
-1.50
-2.50
-1.80
Eisenberg
-2.53
-1.50
-0.90
-0.85
-0.78
-0.74
-0.40
-0.18
-0.05
0.12
0.26
0.29
0.48
0.62
0.64
0.81
1.06
1.08
1.19
1.38
www.bioinformatics.nl
Hydrophobicity Plot



Sum amino acid hydrophobicity values in a given
window
Plot the value in the middle of the window
Shift the window one position
ik
1
Hi 
Hn

2k  1 n i  k
www.bioinformatics.nl
www.bioinformatics.nl
Sliding Window Approach

Calculate property for first sub-sequence

Use the result (plot/print/store)

Move to next residue position, and repeat
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Hydrophobicity Plot
1.5
1
0.5
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-0.5
-1
-1.5
-2
MEZCALTASTESVERYNICE
www.bioinformatics.nl
www.bioinformatics.nl
Transmembrane Regions
Rotation is 100 degrees per amino acid
Climb is 1.5 Angstrom
per amino acid residue
www.bioinformatics.nl
www.bioinformatics.nl
Transmembrane Regions
30 angstrom
www.bioinformatics.nl
So we need approx.
30 / 1.5 = 20 amino
acids to span the
membrane
www.bioinformatics.nl
www.bioinformatics.nl
www.bioinformatics.nl
Adapting the window size to
the size of the membrane
spanning segment makes the
picture easier to interpret
www.bioinformatics.nl
www.bioinformatics.nl
window = 1
window = 9
window = 19
window = 121
www.bioinformatics.nl
www.bioinformatics.nl
Protein Interactions
www.bioinformatics.nl
Protein Interactions
hemoglobin
Obligatory
www.bioinformatics.nl
www.bioinformatics.nl
Protein Interactions
hemoglobin
Obligatory
www.bioinformatics.nl
Mitochondrial Cu transporters
Transient
www.bioinformatics.nl
Experimental approaches (1)
Yeast two-hybrid (Y2H)
www.bioinformatics.nl
www.bioinformatics.nl
Experimental approaches (2)
Affinity Purification + mass spectrometry (AP-MS)
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases

STRING http://string.embl.de/
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases


STRING http://string.embl.de/
HPRD http://www.hprd.org/
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases
www.bioinformatics.nl
www.bioinformatics.nl
Interaction Databases



STRING http://string.embl.de/
HPRD http://www.hprd.org/
InteroPorc http://biodev.extra.cea.fr/interoporc/Default.aspx
Many others….
E.g. see

http://nar.oxfordjournals.org./content/39/suppl_1.toc
www.bioinformatics.nl
www.bioinformatics.nl
Yeast protein interaction network
www.bioinformatics.nl
www.bioinformatics.nl
Sequence-based Protein Binding
Site Prediction
www.bioinformatics.nl
Binding site
www.bioinformatics.nl
www.bioinformatics.nl
Binding site
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Predefined motifs
www.bioinformatics.nl
www.bioinformatics.nl
Motif search in groups of proteins
• Group proteins which have same interaction partner
• Use motif search, e.g. find PWMs
Neduva Plos Biol 2005
www.bioinformatics.nl
www.bioinformatics.nl
Motif search in groups of proteins
• Group proteins which have same interaction partner
• Use motif search
www.bioinformatics.nl
www.bioinformatics.nl
Correlated Motif Search
www.bioinformatics.nl
www.bioinformatics.nl
Correlated Motif Search
Interactors
AARLL PLTEQ
MARLT DLTEP
VVRLM MMTER
Non-interactors
AARLL MARLT
VVRLM MARLT
PLTEQ DLTEP
Correlated Motif Pair: (RL,TE)
www.bioinformatics.nl
www.bioinformatics.nl
Experimental validation
Van Dijk et al, Plos Comp Biol 2010
www.bioinformatics.nl
www.bioinformatics.nl
New approach: slider
•
•
Faster approach  genome wide searching for interaction motifs
Improve mining algorithm with a priori biological knowledge
(conservation score, surface accessibility)
www.bioinformatics.nl
www.bioinformatics.nl
Boyen et al, IEEE/ACM Trans Comput Biol Bioinform. 2011


THE END…..
Questions?
www.bioinformatics.nl
www.bioinformatics.nl
www.bioinformatics.nl
www.bioinformatics.nl
Secondary Structure Prediction
www.bioinformatics.nl
Secondary Structure Prediction

Traditional methods (statistical and/or rule-based)

E.g. Garnier, Osguthorpe & Robson
• Statistical method

Accuracy ~ 60%
www.bioinformatics.nl
www.bioinformatics.nl
GOR Helix Parameters
i-8
Gly -5
ala 5
val 0
leu 0
ile 5
ser 0
thr 0
asp 0
glu 0
asn 0
gln 0
lys 20
his 10
arg 0
phe 0
tyr -5
trp -10
cys 0
met 10
pro -10
-10
10
0
5
10
-5
0
-5
0
0
0
40
20
0
0
-10
-20
0
20
-20
i-6
-15
15
0
10
15
-10
0
-10
0
0
0
50
30
0
0
-15
-40
0
25
-40
-20
20
0
15
20
-15
-5
-15
0
0
0
55
40
0
0
-20
-50
0
30
-60
i-4
i-2
-30 -40 -50 -60
30 40 50 60
0
0
5 10
20 25 28 30
25 20 15 10
-20 -25 -30 -35
-10 -15 -20 -25
-20 -15 -10
0
10 20 60 70
-10 -20 -30 -40
5 10 20 20
60 60 50 30
50 50 50 30
0
0
0
0
0
5 10 15
-25 -30 -35 -40
-50 -10
0 10
0
0 -5 -10
35 40 45 50
-80-100-120-140
www.bioinformatics.nl
i
-86
65
14
32
6
-39
-26
5
78
-51
10
23
12
-9
16
-45
12
-13
53
-77
-60
60
10
30
0
-35
-25
10
78
-40
-10
10
-20
-15
15
-40
10
-10
50
-60
i+2
-50
50
5
28
-10
-30
-20
15
78
-30
-20
5
-10
-20
10
-35
0
-5
45
-30
-40
40
0
25
-15
-25
-15
20
78
-20
-20
0
0
-30
5
-30
-10
0
40
-20
i+4
-30
30
0
20
-20
-20
-10
20
78
-10
-10
0
0
-40
0
-25
-50
0
35
-10
-20
20
0
15
-25
-15
-5
20
70
0
-5
0
0
-50
0
-20
-50
0
30
0
i+6
-15
15
0
10
-20
-10
0
15
60
0
0
0
0
-50
0
-15
-40
0
25
0
-10
10
0
5
-10
-5
0
10
40
0
0
0
0
-30
0
-10
-20
0
20
0
i+8
-5
5
0
0
-5
0
0
5
20
0
0
0
0
-10
0
-5
-10
0
10
0
www.bioinformatics.nl
I S G A R N I E R H E L I X P R E D I C T
i-8
Gly -5
ala 5
val 0
leu 0
ile 5
ser 0
thr 0
asp 0
glu 0
asn 0
gln 0
lys 20
his 10
arg 0
phe 0
tyr -5
trp -10
cys 0
met 10
pro -10
-10
10
0
5
10
-5
0
-5
0
0
0
40
20
0
0
-10
-20
0
20
-20
i-6
-15
15
0
10
15
-10
0
-10
0
0
0
50
30
0
0
-15
-40
0
25
-40
-20
20
0
15
20
-15
-5
-15
0
0
0
55
40
0
0
-20
-50
0
30
-60
i-4
i-2
-30 -40 -50 -60
30 40 50 60
0
0
5 10
20 25 28 30
25 20 15 10
-20 -25 -30 -35
-10 -15 -20 -25
-20 -15 -10
0
10 20 60 70
-10 -20 -30 -40
5 10 20 20
60 60 50 30
50 50 50 30
0
0
0
0
0
5 10 15
-25 -30 -35 -40
-50 -10
0 10
0
0 -5 -10
35 40 45 50
-80-100-120-140
www.bioinformatics.nl
i
-86
65
14
32
6
-39
-26
5
78
-51
10
23
12
-9
16
-45
12
-13
53
-77
-60
60
10
30
0
-35
-25
10
78
-40
-10
10
-20
-15
15
-40
10
-10
50
-60
i+2
-50
50
5
28
-10
-30
-20
15
78
-30
-20
5
-10
-20
10
-35
0
-5
45
-30
-40
40
0
25
-15
-25
-15
20
78
-20
-20
0
0
-30
5
-30
-10
0
40
-20
i+4
-30
30
0
20
-20
-20
-10
20
78
-10
-10
0
0
-40
0
-25
-50
0
35
-10
-20
20
0
15
-25
-15
-5
20
70
0
-5
0
0
-50
0
-20
-50
0
30
0
i+6
-15
15
0
10
-20
-10
0
15
60
0
0
0
0
-50
0
-15
-40
0
25
0
-10
10
0
5
-10
-5
0
10
40
0
0
0
0
-30
0
-10
-20
0
20
0
i+8
-5
5
0
0
-5
0
0
5
20
0
0
0
0
-10
0
-5
-10
0
10
0
www.bioinformatics.nl
GOR Prediction
beta sheet
helix
www.bioinformatics.nl
www.bioinformatics.nl
Secondary Structure Prediction

Recent methods

Neural networks
Multiple alignments
Heuristics

Or a combination of the above



= flexible statistics
= variability
= common sense
Accuracy ~ 70%
www.bioinformatics.nl
www.bioinformatics.nl
Heuristics


Conserved parts are structurally and/or
functionally important
Segments with many gaps must be in loop
regions
www.bioinformatics.nl
www.bioinformatics.nl
Secondary Structure Prediction

Strategy

Use as many methods as possible

Use homologous sequences

Combine predictions into consensus prediction
www.bioinformatics.nl
www.bioinformatics.nl
Why can’t it be 100% correct?

All current 2D prediction schemes are based
upon observation of occurrence of 2D elements in
3D structures

Deduction of 2D elements from structures is
ambiguous!

DSSP, Stride, and the PDB (human) annotation do not
always agree upon the assigned elements
www.bioinformatics.nl
www.bioinformatics.nl
Do these residues still belong to the helix?
www.bioinformatics.nl
www.bioinformatics.nl
Related documents