Download Linear motifs and phosphorylation sites

Document related concepts
no text concepts found
Transcript
Linear motifs and
phosphorylation sites
What is a linear motif?
(in molecular biology)
…a first taste
Short sequence of amino acids encoding
a particular molecular function
Linear Motifs
Functional sites
We need a more accurate definition!
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
Tyrosine kinsase Src has several functional sites
CSK phosphorylation (Y527) &
SH2 ligand
Myristoylation site
SH3 ligand
Auto phosphorylation site (Y416)
p53 is full of functional sites
CYCLIN
MDM2
NES
TAFII31
P300
P300
Pin1 P-Ser-Pro isomerisation
Acetylation
SUMO
Ubiquitinylation
phosphorylation
NLS
CBP
S100B
SIR2
The sequences of many proteins contain short,
conserved motifs that are involved in recognition
and targeting activities, often separate from other
functional properties of the molecule in which they
occur.
These motifs are linear, in the sense that threedimensional organization is not required to bring
distant segments of the molecule together to make
the recognizable unit.
Tim Hunt (TIBS 1990)
The conservation of these motifs varies: some are
highly conserved while others, for example, allow
substitutions that retain only a certain pattern of
charge across the motif.
Tim Hunt (TIBS 1990)
A more accurate definition
• short, common stretches of polypeptide chains (~ 3-10 amino acid
residues long)
• embody a distinct molecular function independent of a larger
sequence/structure context.
• bind with low affinity (1.0-150 M). Mediate transient interactions.
• are nearly always involved in regulation
• are involved in protein/domain-protein/domain interactions
• often reside in disordered or low-complexity regions
• often become ordered upon binding to another protein or domain
• occurrences of LMs seem to arise or disappear as a result of point mutations
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
Why are they important?
Evolutionary unrelated protein sharing a functional feature are likely to
contain similar linear motifs
This may be the result of
- convergent evolution
- evolutionary conservation in a divergent evolution process
In any case, linear motifs are indicative of functions
In other words…
They are made up of the amino acid residues encoding
a functional site
With the appropriate tools, they can be used to identify:
•protein functions
•functional regions (in a protein sequence and on its threedimensional structure, if available)
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
Can we classify LMs? How?
Can we classify LMs? How?
Functional group
Functional site (Linear Motif)
PRACTICE: Let’s find linear motifs in human p53…
Go to the UniProt website: http://www.uniprot.org/
Type p53 in the Query text box and select P04637
or
Type directly either P04637 or P53_HUMAN in the Query text box
Work in groups and analyse the p53 entry record:
- how many LMs can you identify?
- which function(s) are they indicative of?
- are they always annotated as “motif”?
- can you classify them according to the 4 categories?
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
How can we represent LMs?
Alignment of cyclin ligands
inhibitors
Regular expression: [RK].L.{0,1}[FLIV]
How can we represent LMs?
Alignment of cyclin ligands
inhibitors
Regular expression: [RK].L.{0,1}[FLIV]
Regular Expression (regexp)
L: single amino acid “L” = Leucine
[KR]: different amino acids allowed at this position
x or .: wildcard
{0,1}: variable length
Regular Expression: Examples
Before we describe what regexp are useful for, let’s briefly see how
to discover de novo motifs
In some cases, the structure and function of
an unknown protein which is too distantly
related to any protein of known structure to
detect its affinity by overall sequence
alignment may be identified by its possession
of a particular cluster of residues types
classified as a motifs. The motifs, or
templates, or fingerprints, arise because of
particular requirements of binding sites that
impose very tight constraint on the evolution
of portions of a protein sequence
Arthur Lesk, 1988
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
In contrast to domains, which are readily
detectable by sequence comparison, linear
motifs are difficult to discover due to
their short length, a tendency to reside in
disordered regions in proteins, and limited
conservation outside of closely related
species.
Neduva et al. PLoS Biology 2005
De novo Linear Motif discovery
 Study literature paper(s)/review(s) on a group of unrelated proteins
sharing a function
 Build an alignment of these proteins
 Add to the alignment other sequences relevant to the subject under
consideration
 Pay attention to the residues and regions thought or proved to be
important to the biological function of that group of proteins:
• enzyme catalytic sites
• PTM sites
• regions involved in binding
 Try to find a short conserved sequence which includes functionally
important residues
Discovery of de novo Linear Motif
There are algorithms that do it automatically
Neduva et al. PLoS Biology 2005
Discovery of de novo Linear Motif
Our central hypothesis is that proteins with
a common interaction partner will share a
feature that mediates binding, either a
domain or a linear motif. In the absence of a
shared domain, a linear motif could well be
the only common sequence feature and might
thus be detectable simply by virtue of overrepresentation, which is the basis of our
approach.
Neduva et al. PLoS Biology 2005
A probabilistic method for identifying over-represented, convergently
evolved, short linear motifs in proteins.
Edwards et al. PLoS ONE 2007
PRACTICE: Discovery of de novo Linear Motifs
Dilimot
http://dilimot.russelllab.org/
SLIMFinder
http://www.southampton.ac.uk/~re1u06/software/slimfinder/
What are you going to learn about Linear Motifs?
Where can we find them?
Why are they important?
Can we classify them?
How can we represent them?
How can we discover them?
When and how can we use them?
What are tools and resources to handle them?
Linear Motif Databases
ELM
PROSITE
R.[RK]{1,2}.R
R-x-[RK]-x(1,2)-R
1632 documentation entries
(domains and functional sites)
174 manually annotated motifs
16-03-2012
What regular expressions are useful for?
How can we use regular expressions?
Regular expressions can be used to search for motif occurrences in
(uncharacterised) protein sequences
There are algorithms that do this for us
We call the occurrence of a motif in a sequence an INSTANCE of that motif
A motif (a regexp) can have many instances
SH3 ligand motif
[RKY]..P..P
KKVAVVRTPPKSPSSAKSRL
ISPPTPKPRPPRPLPVAPGS
EDQILKKPLPPEPAAAPVST
SHRKTKKPLPPTPEEDQILK
TRICKIYDSPCLPEAEAMFA
TAU_HUMAN
P85A_HUMAN
BTK_HUMAN
BTK_HUMAN
RAD51_HUMAN
Prediction of new instances of Linear Motifs
ScanProsite
INPUT: a protein sequence
OUTPUT: PROSITE or user-defined motif matches in the input sequence
Allows the search for user-defined regular expressions
Scansite
ELM
MiniMotifMiner
INPUT: a protein sequence
OUTPUT: scansite motif matches in the input sequence
INPUT: a protein sequence
OUTPUT: ELM motif matches in the input sequence
INPUT: a protein sequence
OUTPUT: MiniMotifMiner motif matches in the input sequence
PRACTICE: Prediction of new instances of Linear Motifs
Go to the ScanProsite website and search for the RGD motif in the SwissProt database
http://prosite.expasy.org/scanprosite/
R-G-D
Select database
How many hits?
How many hits are expected by chance?
Regular expression pros and cons
Unfortunately matches to these motifs are not significant, providing
a signal-to-noise problem for bioinformatics tools
Advantages
Disadvantages
Memorable to humans
Over determined
Computationally fast
Motif may vary in other lineages
Standardised in scripting
languages (Python, Perl)
Do not capture weaker
preferences
Often, they can descrive a motif
very well
Easy to make a poor
representation
Overprediction and context information
Functional sites only work in proper context
The cell knows how to discriminate TP from FP !!!
The site must be in the correct
cellular context
(subcellular localisation)
The site must be in correct molecular
context
- accessible
- usually not in globular domains,
- often together with certain types of co-domains
The site is only relevant in a specific
taxonomy range
Knowledge of context can provide the basis
for filters for improved prediction of
functional sites
For example…
Globular domain filter
Motifs are mostly found in disordered regions
The disordered regions are proving to be rich in Linear Motifs
Src kinase
We can exploit this observation and filter out motif matches inside
domains
Structural Filter
Motif matches are not ALWAYS outside domains
Inside domains they are unlikely unless in surface loops
When inside a domain, a motif match is more likely to be a True
Positive (TP) if it occurs in a flexible (i.e. loop, turn or linker) and
accessible region of the domain
The RGD motif is recognized by different members of the integrin family
An exposed instance of
the RGD motif in a domain
An instance of the RGD
motif in a region outside
a domain
MOD_N-GLC_1 (.(N)[^P][ST]..) is a motif for N-glycosilation site
Two MOD_N-GLC_1
motifs in a domain
Structural Filter
We can think to implement a filter that is based on the three-dimensional
features of motifs (i.e. their accessibility and secondary structure types)
If the match is not accessible
low score
If the match is in -helix
low score
If the match is in -strand
low score
Other features that can be used to filter out FPs:
•Taxonomy
•Cellular compartment
•Evolutionary conservation
Davey NE et al. Mol Biosyst 2011
Why is a Conservation Score useful for linear motif prediction?
Improve the prediction of LM instances by discarding those
matches that are unlikely to be functional because they
have not been conserved during the evolution of the protein
sequences
There is a resource which implements these filters
It associates a score to occurrences of motifs based on
•Cellular context
•Molecular context
•Domain context
•Disorder
•Taxonomy
•Evolutionary conservation
The Eukaryotic Linear Motif (ELM) Resource implements
a logical filtering system to reduce false matches
The Eukaryotic Linear Motif (ELM) Resource
• Repository of information about functional sites
(including experimentally reported instances)
• A motif-based query tool to find possible new functional
sites
• A logical filtering system to reduce false matches
The ELM Resource - An overview
Query
Sequence
User
Data
ELM search
engine
Scientific literature
Candidate
motifs
ELMdb
Instance Data
FILTERS
Conextual information
Filter information
Rejected
predictions
Retained
predictions
PRACTICE: The ELM server (http://elm.eu.org/)
Go to the ELM server
Search for motif matches in the EH domain-binding mitotic phosphoprotein
Output 1
instance in
structurally
unfavourable
context
annotated
instance
Instance in
unfavourable
context
highly
conserved
instance
Output 2
Output 2
Browse the ELMs page for the Clathrin Box motif in Endocytosis cargo adaptor
proteins (ELM: LIG_AP2alpha_2)
Link to reported
instances
Exploring unknown
protein sequences
Phosphorylation sites
Phosphorylation is the addition of a phosphate group (PO4) to a protein
molecule or small molecule.
The hydroxyl groups (-OH) of SER, THR or TYR residues side chain are
the most common targets
Reversible protein phosphorylation
A protein
kinase moves
a phosphate
group from
ATP to the
protein
ATP (adenosine triphosphate) is
the energy currency of the living
world. Every cellular process that
requires energy gets it from ATP
A protein phosphatase
removes the
phosphate and the
protein reverts to
its original state.
•It is rapid (few seconds)
•It is easily reversible
Reversible protein phosphorylation regulates most aspects of cell life
~ one third of cellular proteins could undergo phosphorylation
It is involved in regulation of
metabolism, motility, growth,
division, differentiation,
trafficking, membrane transport,
learning, memory
Even subtle changes in the activity
of protein kinases can lead to a
variety of diseases (cancer)
Phosphorylation is a Post Translational Modification (PTM)
A kinase recognises its substrate and adds a phosphate
group (PO4) to one of its residues, typically a Serine (Ser, S),
Threonine (Thr, T), or Tyrosine (Tyr, Y)
Amino acid phosphorylation is probably the most
abundant of the intracellular PTMs used to regulate the
state of eukaryotic cells, with estimates ranging up to
500,000 phosphorylation sites in the human proteome
Nevertheless…
Substrate recognition is specific
In other words…
Each kinase is capable of recognising its substrate(s) in the cell
In fact, the enzymes must be specific and act only on a defined subset of
cellular targets to ensure signal fidelity.
Even though the determinants of specificity are still unclear
Substrate recruitment is one of the known specificity mechanisms
The protein composition around the phosphorylatable site is another factor
Kinases are capable
of recognising the
region surrounding
the phosphoacceptor
residue (in sequence
and/or in structure)
In fact, kinases do not phosphorylate every Ser, Thr, Tyr they encounter
Kreegipuu et al, NAR 1998
in the cell
A phosphorylation site can be represented by a phosphorylation motif
Experimentally verified phosphorylation motifs can
be used to predict new phosphorylation sites and
characterise kinase substrates
There are many resources collecting P-sites and many tools to
predict P-sites in user-defined protein sequences
Collection of instances of P-sites
Prediction of new instances of P-sites
Phospho.ELM
phospho.elm.eu.org/
Phospho.ELM
phospho.elm.eu.org/
PhosphoSitePlus
www.phosphositePlus.org/
Scansite
scansite.mit.edu/
PHOSIDA
www.phosida.com/
NetPhos
www.cbs.dtu.dk/services/NetPhos/
PHOSPHORYLATION SITE DATABASE
www.phosphorylation.biochem.vt.edu/
NetPhosK
www.cbs.dtu.dk/services/NetPhos/
Phospho.3D
www.phospho3d.org/
NetworKIN
networkin.info/search.php
KinasePhos
KinasePhos.mbc.nctu.edu.tw/
Predikin
predikin.biosci.uq.edu.au/
Phospho.ELM
phospho.elm.eu.org
Database of experimentally verified phosphorylation sites
in eukaryotic proteins
Current release contains:
•42,914 instances (fully linked to literature references)
• 299 kinases
• 11,224 sequences
• 8,698 substrates
PRACTICE
Go to the Phospho.ELM website
and search P-sites for p53
ELM and Phospho.ELM are interconnected
PhosphoBlast
Structural information on P-sites and 3D scan
Phospho.3D
http://www.phospho3d.org/
PRACTICE
Go to the Phospho.3D
website and search all the
substrates of the Src
kinase
Suggestions to predict P-sites in unknown sequences
MEESQSDISLELPLSQETFSGLWKLLPPEDILPSPH
CMDDLLLPQDVEEFFEGPSEALRVSGAPAAQDPVTE
TPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFL
QSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSAT
PPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGD
GLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEP
PEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLE
DSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEV
LCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTL
KIRGRKRFEMFRELNEALELKDAHATEESGDSRAHS
SYLKTKKGQSTSRHKKTMVKKVGPDSD
?
Exploring unknown protein sequences
• Go to UniProt (or Blast your sequence against the UniProt database)
and explore the sequence annotation
• Go to Phospho.ELM and scan the sequence
• Go to PHOSIDA and PhosphoSitePlus and do the same
• Use different predictors and select only high scoring sites
• Use evolutionary information:
- is the site conserved?
• Use domain (SMART and Pfam) databases:
- is the site inside a domain?
• Use structural information if available:
- is the site exposed?
- is it in a flexible region?
Exploring unknown protein sequences
When all information is collected, only retain sites predicted by more
than one tool
Amongst these, for further experimental tests, preferably choose sites
that are:
•Not inside domain(s)
•Not in secondary structure elements (helices and strands)
•Accessible to the solvent
•Evolutionary conserved