Download proposal-aug25

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemistry wikipedia , lookup

Non-coding DNA wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Proteasome wikipedia , lookup

Gene expression wikipedia , lookup

Molecular evolution wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

SR protein wikipedia , lookup

Point mutation wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein domain wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Cyclol wikipedia , lookup

Western blot wikipedia , lookup

Interactome wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein adsorption wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Proteomics wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Homology modeling wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Transcript
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
I. Overview
It is increasingly appreciated that many disease associated proteins contain regions of intrinsic
disorder. However, relatively little is understood about the functions of these regions and it is not
currently possible to predict the impact of mutations in these regions. We propose to analyze the
function of disordered regions using a new systematic computational approach. We will then focus on
specific examples to experimentally test the mechanisms of function of the identified disordered regions.
This proposal represents a new approach to attack a difficult problem in protein biochemistry: the
function of intrinsically disordered proteins.
II. Background and Motivation
As many as 50% of human proteins are thought to contain intrinsically disordered regions [1,2],
including many important disease associated proteins such as p53 [3], BRCA1[ 4] and CFTR[5].
Although important biological functions have been described for specific examples of disordered
regions, little is known in general about the sequence-function relationship for most residues in these
regions [6-10]. One established model is that disordered regions are important for protein regulation [1],
and contain short linear motifs (short peptide sequences important for protein interaction [11]).
We propose to apply this model systematically using the ‘comparative genomics’ paradigm, which
exploits the observation that functional sequences are preferentially preserved over evolution [12-14].
Because the short linear motifs within disordered regions are important for function, they are expected to
be preferentially conserved relative to flanking residues. We and others have successfully exploited such
evolutionary conservation to identify thousands of conserved regions in non-coding DNA [15,16], and
these have been demonstrated to have important functions in transcriptional regulation [17].
Evolutionary methods have also been widely applied to protein sequences to detect remote homologues
and to identify critical functional residues and motifs [18-21]. Here we propose to develop an
evolutionary method to identify short conserved segments within disordered regions of proteins, as a
means to identify the short linear motifs important for function. This application of ‘comparative
genomics’ is (to our knowledge) novel. We will therefore devote considerable attention to confirming
the effectiveness of the method in this new context. If successful, the methods we develop will
represent a substantial contribution and will be generally applicable beyond the scope of this project.
Why an evolutionary approach for disordered regions? For the vast majority of intrinsically
disordered regions, specific biological functions remain unknown. Computational methods can readily
identify the disordered regions based on amino acid sequence composition [22] (see [23] for review) and
in principle, function could be predicted by previously developed computational methods to recognize
short linear motifs [24-28] (reviewed in [29,30]). Unfortunately, these motifs, typically only 3-8 amino
acids long, are not statically significant when search algorithms are applied at the proteome scale [31];
recent approaches have therefore resorted to relying on external functional data [32,33].
More general computational methods to predict interaction sites have also been developed [3439] (reviewed in [40,41]), but these methods show little predictive power when structural information is
not available [40]. They are therefore not applicable to the majority of disordered regions. For example,
ANCHOR [37] was developed to identify binding sites in intrinsically disordered regions, and classifies
between disordered binding sites and globular proteins with high accuracy [37]. Nevertheless, its
predictions of binding sites are not specific enough to identify short functional motifs (see below).
Evolutionary approaches can predict functional residues with great power [19,42], and given the
extensive sequence databases becoming available for closely related species, their application to
1
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
disordered regions is timely. An evolutionary approach will also yield hypotheses about the regulatory
function of intrinsically disordered regions. Although some disordered regions become ordered upon
binding, many are thought to remain in predominantly disordered conformations in vivo [43] and their
functions often take advantage of this structural flexibility [44]. In addition, disordered regions may act
as scaffolds for signaling proteins [45,46], switches for regulation of protein stability or may fine-tune
biochemical activity [47-49]. These functions will influence the composition and evolution of the short
linear motifs within the disordered region.
We will analyze the functions of the disordered domains based on the short linear motifs they contain.
Focusing on phosphorylation sites, we will confirm the functions of conserved motifs using site-specific
mutagenesis and in vitro kinase assays. We will also explore the mechanisms of function of three
selected disordered regions using NMR. We and others have previously applied NMR to explore the
structural features of disordered regions in several proteins, including CFTR, Sic1, I-2, spinophilin and
others [47-49] (reviewed in [50]). The combination of detailed mechanistic studies with the genomewide unbiased analysis of disordered regions will allow us to generalize beyond the small number of
examples that can be characterized in detail.
III. Specific aims and research plan
Aim 1: Use evolutionary conservation to identify functional elements within disordered regions.
Aim 2: Experimentally analyze disordered regions containing predicted short linear motifs.
Please see Appendix 2 for a schematic outline of the proposal and Appendix 3 for information about
completion of specific tasks
Specific Aim 1. Identify conserved segments within disordered regions
Natural selection is expected to remove mutations in functionally important sequences, leading to slower
evolution in short linear motifs. Indeed, we and others have recently shown that characterized short
linear motifs are conserved relative to the flanking amino acid sequences [51-55]. Indeed, conservation
of short linear motifs has already been used to search for examples of known motifs [20,21].
To exploit evolutionary conservation to systematically identify functional elements within
disordered regions, we propose to use a state-of-the-art probabilistic model known as a 'phyloHMM'
[56] to identify short stretches of amino acid sequence that are evolving more slowly than the
immediately flanking sequences. Briefly, our phyloHMM follows previous work [56] by assuming that
each column in a multiple sequence alignment (Figure 1a) can be classified into a ‘conserved state’ or a
‘background’ state; it then reports for each residue the posterior probability that it falls in the conserved
state (Figure 1b). PhyloHMMs explicitly account for the similarity of sequences related by a
phylogenetic tree and therefore can extract the maximum signal from the multiple alignments. Because
insertions and deletions are prevalent in disordered regions, we will modify the standard probabilistic
models underlying the phyloHMM [57,58] to include an insertion/deletion process (in addition to the
standard substitution process). These phyloHMMs will be implemented by Alex Nguyen Ba, a PhD
student in the Moses lab.
Choosing species for the analysis: We will start with analysis of the budding yeast proteome. Because
the computational methods rely on alignments of orthologous sequences, the high-quality genome
annotations and syntenic orthologs available for budding yeast from YGOB [59]. Proteins in budding
yeast and its relatives usually contain only a single exon, and a single transcript. This makes gene
2
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
prediction identification and assignment of 1 to 1 relationships between protein coding sequences from
multiple species relatively straightforward. Because of the much greater complexity of vertebrate gene
structures and large transcript numbers the bioinformatics analysis is more time-consuming. A graduate
student will be recruited to the Moses Lab to extend and apply these methods to the human proteome.
Meanwhile, we have focused our energy on the budding yeast model system.
Confirming the phyloHMM methodology: To our knowledge the application of phyloHMMs to
disordered regions in proteins is novel. We will therefore perform two important tests of the
methodology in application to alignments of yeast and vertebrate proteins: (i) confirm that “conserved”
short segments are accurately aligned, (ii) confirm that when short linear motifs are actually conserved,
the phlyoHMM has the power to detect them. We will compute ROC curves showing true positive and
false-positive rates as we vary the assumptions about the evolution of the short linear motifs and the
surrounding sequences.
(i) Confirm alignment accuracy: Due to the rapid evolution of intrinsically disordered regions, truly
conserved short linear motifs may be aligned incorrectly in the multiple sequence alignments. We will
perform a series of simulations of molecular evolution of disordered regions. In these simulated proteins,
we can insert conserved protein domains, as well as conserved short linear motifs. Because we know the
short linear motifs in these simulated proteins are conserved, we can test whether the aligner can
correctly align them. Preliminary results: In our simulations, for typical motif evolutionary rates, e.g.,
25% of background, we find that ~95% of the artificial motifs are correctly aligned (Figure 2a).
(ii) Confirm phyloHMM power: We will run our phyloHMM on the simulated alignments and test
whether it can identify the short linear motifs that we know are conserved. For a more realistic measure
of power, we will also curate a set of experimentally characterized short linear motifs from the literature
and identify which of these are conserved. We will then run our phyloHMM method on the proteins that
contain these bona fide short linear motifs and test whether it can correctly identify them.
Preliminary results: Our preliminary analysis based on simulated proteins indicates that using the
phyloHMM we can identify ~50% of conserved motifs as short as 2 amino acids long (false negative
rate = 50%, Figure 2b), with false positive rates of less than 1 per 2500 amino acids (Figure 2c). By
comparison, we also applied ANCHOR [37] to simulated disordered proteins. Even when we planted no
short linear motifs in these sequences, it predicted 47% (±9%) of amino acids to be within protein
binding sites. Figure1b, Figure 6 and Figure 10 show direct comparison of the phyloHMM and
ANCHOR. To test the predictive power of the phyloHMM on real short linear motifs, we curated from
the yeast literature 106 experimentally characterized short linear motifs that are preserved in at least
90% of the species we used for multiple alignments [59]. When we analyze this data set, we find that
68% are identified by the phyloHMM at a posterior probability threshold of 0.6, for a false negative rate
of 32%. Despite the high false negative rate, these preliminary results indicate that it is possible to
identify short conserved segments in unstructured regions with the characteristics of short linear motifs.
Genome-wide identification of conserved segments in yeast and human disordered regions: Once we
have confirmed that the phyloHMM method can be applied to identify conserved segments in disordered
regions, we will systematically apply it to alignments of all proteins from budding yeast (and related
fungi [59]) and human (and related vertebrates [60]). We will then identify the disordered regions and
protein domains in the human and yeast proteins (using DISOPRED [23] and Pfam [61]) and collect the
3
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
short conserved segments that fall within disordered regions, but do not match any known protein
domains. To estimate the expected number of false positives identified in the cluster analysis we will
repeat the clustering on the motifs identified in simulated disordered proteins. These computational
studies will be performed by a PhD student in the Moses lab in the first years of the project. To our
knowledge, this analysis will represent a totally novel approach to identify (and quantify) the functional
content of disordered regions.
Preliminary results: Applying our phyloHMM to alignments of the entire yeast proteome yields >7000
short conserved segments, of which we expect only ~ 200 to be false positives. This provides (to our
knowledge) the first unbiased characterization of the amount of functional amino acid sequence within
intrinsically disordered regions. At the residue level, this indicates that at least ~5% of the amino acids
in intrinsically disordered regions are under specific evolutionary constrains and are therefore very
likely to have biological functions.
To confirm that the short conserved segments are indeed short linear motifs, we have searched for
conserved segments that match known short linear motifs. For example, the FG motif is found in
disordered regions of nuclear pore complex (NPC) proteins [67]. Of the 30 components of the NPC, 13
are known to contain FG repeats which are thought to be biologically important for the nuclear import
and export of proteins [67]. To test whether the phyloHMM approach can identify these, we searched
the conserved segments for those that matched the FG-motif. We found only 59 proteins that contained
conserved FG-motifs, including 12 of the 13 previously known examples (Figure 3b). Since there are
3438 proteins for which alignments are available, this represents a highly significant enrichment: 12/59
vs 13/3438, P-value = 7.21 x 10-16. In another test of our method, we applied a similar statistical
analysis to the set of conserved elements containing the canonical phosphorylation site consensus
sequence (S/T-P-x-R/K) of the Cdc28 kinase. Of 695 proteins tested in a high-throughput in vitro kinase
assay [68], our phylo-HMM identifies 40 proteins containing a short conserved sequence which matches
the Cdc28 consensus sequence. Of those, 32 (80%) were found to be positive in their assay which is a
highly significant enrichment (32/40 vs 185/695, P-value = 1.4 * 10-11). Of the 8 remaining proteins that
contain conserved consensus sites, but were not identified as positives in the assay, one is a known
substrate of Cdc28p (Cdc15p [69], but was negative in the assay). Two others are known substrates of
the Pho85p kinase (Rim15p [70]) and the Fus3p kinase (Fus2p [71]), both of which can phosphorylate
the canonical Cdc28p consensus sequence, indicating that the phyloHMM can identify bona fide
phosphorylation sites that are missed in the kinase assays. Similarly, we have been able to identify new
examples of other known short linear motifs, such as phosphorylation sites for Cbk1 (See specific Aim2
and the KEN box (Figure 3d) using the phyloHMM method. These results are very encouraging, and
indicate that evolutionary conservation in disordered regions is a strong predictor of biological function.
This preliminary data indicates that our phyloHMM can identify thousands of short conserved
segments in the disordered regions of the budding yeast proteome, and when these conserved segments
match known short linear motifs they are likely to perform the predicted biological function.
Identification of novel short linear motifs: Interestingly, despite the strong statistical results regarding
previously known short linear motifs, we note that most of the identified conserved segments do not
match any known short linear motif [11]. We hypothesize that these conserved segments represent
examples of previously unrecognized short linear motifs. We will therefore attempt to discover these
motifs by identifying families of conserved segments with similar amino acid sequences. We will use
graph-based clustering methods to identify groups of conserved segments based on sequence similarity.
These “clusters” may represent known and novel short linear motifs. To test whether they are associated
4
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
with biological function, we will compare the number of proteins in the cluster with a given functional
annotation to the number of proteins with that annotation expected by chance [63-65]. Statistical overrepresentation (or enrichment) of a particular function leads us to propose that function for the novel
motif. Please see preliminary results below for examples of this analysis, which we consider a
particularly exciting aspect of our proposal -- it will allow us to assign putative functions to completely
novel short linear motifs. As above, we will first apply this to budding yeast, and then extend the
analysis to alignments of vertebrate proteins.
Preliminary results: To associate the thousands of novel conserved segments with functions, we have
performed a graph-based clustering of conserved segments identified in budding yeast alignments using
MCODE [62]. In addition to identifying many known motifs (Figure 3), the cluster analysis revealed
hundreds of new consensus sequences that were not previously recognized as short linear motifs.
Within these clusters, we have 30 unknown consensus sequences with 20 or more conserved examples
in the yeast proteome. Although many of the novel consensus sequences identified by the cluster
analysis might be due to random noise, several of these new putative consensus sequences are
statistically associated with biological functions and are therefore are very likely to represent novel,
previously unrecognized short linear motifs (see Specific Aim 2 below). These include a previously
unreported DSF motif that is associated with amino acid permeases (6/8 vs. 36/3438, P-value = 2.4 *
10-11,), an NPY motif associated with vesicle and nuclear membrane proteins (7/12 vs. 419/3438, Pvalue = 1.7 * 10-5) and an FxFP motif statistically enriched in proteins that physically interact with Cbk1
(Figure 4). Preliminary data (See specific Aim 2 below) indicates that this motif is a bona fide
interaction motif for Cbk1.
Beyond the phyloHMM: While phyloHMMs represent a novel and potentially powerful approach to
identify functional elements in disordered regions, they are limited to those conserved segments that are
preserved at the same location in all species. In fact, preliminary data indicates this is not the case for
many bona fide short linear motifs. For example, we have performed extensive searches of the literature
to identify 530 experimentally characterized short linear motifs in budding yeast. Of these, 305 were
found in disordered regions, but only 106 were conserved in multiple alignments when inspected by eye
(e.g., Figure 5a). Therefore, the “conserved” segments in disordered regions will represent only a
subset of the functional elements in these regions. While our phyloHMM methods can identify this
subset with great power, they are limited to this ‘low-hanging fruit’ of alignment conservation.
To address this, we will also develop methods to identify conserved that are not conserved in the
alignments. We refer to this type of conservation as “alignment-free conservation”. To detect it, we
will consider matches to a consensus motif occurring according to a birth-death process with rates
specified by a “background” amino acid substitution process that has no specific selection to retain
matches to the consensus. If motif matches are retained over evolution beyond what is expected based
in this process, we can infer that selection has acted to preserve them (Figure 5b). To apply this
approach when we do not know the consensus in advance we will scan along the alignment and test each
short sequence as a potential motif.
Preliminary results: We have implemented an algorithm that searches for “alignment-free
conservation” of matches to consensus motifs. We compute the distribution number of motif matches in
orthologous sequences based on the extant sequence in a reference species (budding yeast) under a
background model with no selection pressure to retain them (Figure 5b). When we search the yeast
proteome using this algorithm and known phosphorylation consensus sequences, we can find significant
5
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
enrichment of substrates for multiple kinases (Figure 6b). In addition, we have identified proteins
where consensus sites are not conserved in the alignments, but are very likely to represent novel
substrates. For example, at least 4 proteins in the DNA damage signaling pathway contain highly
conserved clusters of non-aligned matches to the Mec1 consensus motif (Figure 6a). Since Mec1
directly targets 4 other proteins in this pathway, we believe our novel predictions are very promising
(Figure 6a). Indeed proteins with evidence for alignment-free conservation of the Mec1 consensus are
significantly enriched for “response to DNA damage stimulus” proteins (P-value = 2.1 * 10-9).
Anticipated problems and solutions: To our knowledge phyloHMMs have not yet been employed for
analysis of disordered regions, but they have been successfully applied to DNA sequences, leading to
the widely used phastcons 'conservation tracks' at the UCSC genome browser [66]. Since protein
evolution is much more heterogeneous than DNA evolution, if the method employed for DNA
sequences (i.e., searching for the most highly conserved regions [56]) were applied to proteins, it would
identify slowly evolving regions (or structural domains) rather than short linear motifs. Similarly, if
applied naively, our methods that do not rely on strict positional conservation would simply identify
motif matches that happen to occur in highly conserved regions of proteins.
To address this potential problem, we use a ‘local’ rate of evolution against which we compare
the expected pattern of evolution. For the phyloHMM, the background rate of is estimated using a 20
amino acid sliding window across the alignment, and the conserved rate is estimated by taking the
maximum likelihood estimate of the rate at that position up to 1/3 the background rate. For our
alignment-free method, we use a maximum likelihood estimate of the local evolutionary distance in a
window surrounding the consensus sequence we are testing (Figure 5b). For the phyloHMM we also
filter out known protein domains (using Pfam [61]).
Another important technical challenge is the large numbers of insertions and deletions that are
found within alignments of disordered regions. Because insertions and deletions violate the assumption
that each column in an alignment is independent, standard probabilistic phylogenetic models treat
substitutions only. We sought model the “gaps” with a compromise between computational feasibility
and biological realism. To do so, we divide the multiple sequence alignment into “blocks” of constant
gap size (illustrated as black vertical lines around grey shaded areas in Figure 1a,c). While each block
does not necessarily only include one insertion and deletion event, this is a much better approximation
than treating the columns independently. We can then compute probability of each block based on the
gap of size and the distribution on the phylogenetic tree (Figure 1d, eq. 1). We will assume the gap
process and substitution process are independent, so that the likelihood is simply the product of the two
(Figure 1d, eq. 2). This model allows relatively simple parameter estimation by numerically
maximizing the likelihood.
Specific Aim 2. Experimental analysis of predicted short linear motifs and disordered regions
Because application of the phyloHMM to disordered regions is novel, to demonstrate the power of the
methodology we will confirm several predictions using site-specific mutagenesis, in vitro kinase assays
and fluorescence microscopy. We will then turn to more detailed NMR analysis to test hypotheses
about the mechanisms of regulation mediated by the disordered regions.
2a) Experimental confirmation of computational predictions. To show that the evolutionarily
conserved segments in disordered regions actually represent bona fide short linear motifs, we will test
specific examples experimentally in budding yeast, including (i) novel examples of previously known
consensus sequences and (ii) novel consensus sequences identified in our cluster analysis. These
6
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
experiments will provide empirical support for the evolutionary proteomics approach to identifying
functional elements within intrinsically disordered regions. Indeed, our preliminary data below indicates
that when tested, the short conserved segments identified by the phyloHMM are likely to function as
short linear motifs.
(i) New examples of known motifs. To demonstrate that our methods can predict novel phosphorylation
sites for known protein kinases we will perform in vitro kinase assays. We will focus on substrates of
the NDR/LATS family protein kinase Cbk1 in budding yeast. This family of kinases is important for
determination of cell-growth and development and is conserved between yeast and humans [72], but few
direct substrates have been identified. We will search the short conserved segments and identify
matches to the Cbk1 consensus. The disordered regions that contain these consensus sites will be tested
in in vitro kinase assays with Cbk1 purified from yeast cells in collaboration with Eric Weiss’ lab at
Northwestern University (Figure 7). To confirm that the in vitro phosphorylation is due to the Cbk1
kinase, we will repeat these experiments in cells without the Cbk1 kinase activity (Figure 7). We aim to
identify and test 5 novel Cbk1 substrates in collaboration with the Weiss lab in the second year of the
proposal. While this will not be an exhaustive enumeration of substrates, it will be a large enough
number to demonstrate that our methods can identify new substrates, and we will not be particularly
wedded to any individual protein if we find that it is difficult to purify or work with experimentally.
Preliminary data: We identified conserved consensus matches to the Cbk1 consensus [80] in the Nterminus of Sec3p which is predicted to be disordered. In collaboration with Eric Weiss’ group at
Northwestern University, we made alanine mutations in the predicted phosphorylation sites (Figure 7d)
and subjected the Sec3p N-terminus to an in vitro phosphorylation assay. Preliminary results (Figure
7e) indicate that the N-terminus of Sec3p is a very good substrate for Cbk1 in vitro and that
phosphorylation of this protein is also important in vivo (data not shown). We have also identified
additional candidate Cbk1 substrates: Fir1 and Tao3 which were reported to physically interact with
Cbk1 (refs), and Mpt5 an RNA-binding protein (Figure 7f-h) like Ssd1 as known Cbk1 susbstrate.
In our preliminary cluster analysis of conserved segments in disordered regions (Figure 3a), we
identified a cluster corresponding to the KEN-box degradation signal recognized by the APCCdc20 . Only
10 proteins had a short conserved segment matching the KEN sequence. Eight of those contained an
experimentally verified KEN degradation signal [81-83], were characterized targets of the APCCdc20
[84,85] or were cyclins, one of which contains a verified KEN sequence [83]. The two remaining
conserved segments matching the KEN signal (in Spt21p and Sgd1p) have not been associated with the
APC or known to show cell-cycle regulated degradation. Spt21p is a protein involved in regulating
histone transcription and its transcription is cell-cycle regulated [86,87]. Furthermore, over-expression
of Spt21p has been shown to be toxic [88], suggesting a requirement for tight control on protein levels.
We therefore decided to test if the identified KEN sequence in Spt21p was truly a degradation signal.
We first tested if protein abundance was cell-cycle regulated and found that Spt21p protein levels
coincide with Clb2p protein levels (Figure 8b) indicating that, as at the level of mRNA, Spt21p protein
levels vary over the cell cycle. Given the toxicity of over expression, we reasoned that if the KEN
sequence is a biologically relevant degradation signal, then over expression of a KEN-mutant form of
Spt21p would be more toxic than a wt form. We therefore mutated the KEN-box to alanines and
observed that growth was more severely affected in the mutant over-expression that wt (Figure 8d).
This phenotypic effect is consistent with the hypothesis that the evolutionary conserved sequence is
important. Finally, to test the stability of the Spt21p KEN-mutant protein, we assayed protein levels of
wt Spt21p and mutant by over-expressing the protein with the GAL promoter and then shutting off both
7
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
transcription and translation. Indeed, we observed that the mutant protein levels remained high, while
the wt degraded over time (Figure 8e).
(ii) Assign functions to novel motifs. Our cluster analysis will identify a large number of new
consensus sequences that have not been previously recognized. In Specific Aim 1 we will associate
these with putative functions through statistical overrepresentation of functional annotation in the cluster
[63-65]. To test these putative functions, we will make site-specific mutations in the novel motifs. For
example, proteins containing conserved FxFP motifs were associated with physical interactions with
Cbk1. We will therefore test whether mutations in this short peptide can disrupt interactions with this
kinase. Similarly, a previously unreported DSF motif was associated with amino acid permeases. We
hypothesize that this motif will be important for targeting of the permeases to the correct localization.
We will therefore make N-terminal GFP fusion proteins and follow their localization using fluorescence
microscopy before and after mutagenesis of this motif. We will also test whether mutations in this motif
impair permease function by measuring cell growth in auxotrophic strains. These experiments will be
performed by a technician in the Moses Lab throughout the project.
Preliminary data: One of the novel motifs identified in our cluster analysis was an FxFP motif that was
statistically enriched in proteins that were found to interact with Cbk1 in high-throughput studies
(Figure 4d “x” [89,90]). We noted that this motif resembled the docking motif that has been reported
for MAPKs, so we decided to test whether this motif represented a novel docking site for the Cbk1
kinase. To test this, in collaboration with Brian Yeh in Eric Weiss’ lab, we expressed short peptide
fragments containing the conserved segment and tested whether they could bind the Cbk1 kinase domain
in an amylose resin pull-down assay (Figure 4e). Amazingly, all 6 of these peptides showed binding in
this assay, with 4/6 showing strong binding. This confirms that this short motif can mediate direct
interactions with the Cbk1 kinase domain. This preliminary data suggests that this motif is a novel
docking site for this kinase and supports the idea that the novel patterns identified through cluster
analysis represent bona fide short linear motifs motifs.
2b) Testing models for the mechanisms of regulation. Several models have been proposed for the
mechanistic function of disordered regions. Perhaps most familiar is the function of disordered proteins
as scaffolds for signaling and protein complex formation (Figure 9a [44]). Scaffolds are an important
mechanism to ensure specificity in cell-signaling as they bring otherwise potentially promiscuous
enzymes in close physical proximity. For example, MAPKs form canonical three-kinase signaling
cascades, where each kinase has great specificity for the next. However, this specificity is not always
encoded in the direct interactions between kinases, but rather by scaffold proteins that physically link
each kinase to the next. This is one important mechanism by which kinases can be reused in multiple
signaling pathways, but avoid “cross-talk”.
A second well-characterized mechanism for disordered regions is the “multisite regulation”
model where multiple regulatory sites in disordered regions are important for ultrasensitive “switchlike” responses (Figure 9b [49,73]). The paradigmatic example of this mechanism is the cell-cycle
regulator Sic1p, which undergoes multi-site phosphorylation to ensure switch-like onset of G1 phase.
Recently, we have demonstrated that although the N-terminus of Sic1p is disordered, it adopts a very
compact three-dimensional conformation and interacts with its binding partner (Cdc4) in many rapidly
alternating states (Figure 10 [49]). Thus, in the case of Sic1, the flexibility of the disordered protein
seems critical to allow all of the sites to interact with Cdc4, thus providing the mechanistic basis for the
switch-like regulation [74].
8
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
Finally, disordered regions can integrate signals from multiple binding partners (Figure 9c,
[47]). Regulatory proteins and internal binding domains can compete for binding sites within a protein
to activate or inhibit the biochemical function. For example, CFTR, the protein mutated in cystic fibrosis
[47], contains a large disordered region known as the R-region that is important for controlling the
activity of the CFTR channel [47]. The R-region contains multiple binding sites for interaction partners
(such as NBD1) whose binding propensity is modulated by phosphorylation. Depending on the
propensity of these binding sites for their interacting proteins, they will be either highly bound or largely
free. These multiple binding sites in the disordered region therefore integrate multiple signals to control
activation of the channel. Because of their intrinsic structural flexibility, disordered proteins can have
many more binding partners than ordered proteins and therefore are well-suited to these types of
functions [44].
All of these models make predictions about the organization and evolution of the functional
sequences within the disordered regions. Based on the patterns of conserved segments within the
disordered regions identified by the phyloHMM (Specific Aim 1) we will analyze the mechanism of
function by (i) integrating high-throughput functional data with our phlyoHMM predictions to test these
models systematically and (ii) testing examples of specific proposed mechanisms using NMR studies.
(i) Statistical tests of models for disordered region function. The scaffold model predicts that
disordered proteins will contain large numbers of short conserved protein binding motifs and show a
large number of protein-protein interactions with related biochemical functions. For example, Las17p
contains 7 conserved putative SH3 binding sites, and binds to 9 SH3 containing proteins in highthroughput studies [75], many of which have functions related to actin assembly. To test this
statistically, we will randomly permute the large-scale protein interaction data for each protein and test
for an excess of proteins with many conserved binding motifs associated with large number of protein
interactions. The multisite site regulation model predicts that proteins will have a large number of
conserved motifs that match the same consensus sequence, and interact with only a single regulator that
recognizes that consensus. For example, the N-terminus of Sic1p contains 9 weak binding sites for
Cdc4 [73,76] that contribute to switch-like degradation of this protein. Most of these weak sites are
highly conserved, and therefore readily identified by the phyloHMM (Figure 1). To test for this
statistically, we will compare the number of proteins with large number of the same conserved motifs
and a small number of protein interactions to that expected if the protein motifs were distributed
randomly amongst the disordered regions. Finally, integrator model predicts that disordered regions will
many different conserved segments and interact with multiple regulators that correspond to those short
linear motifs. To test this statistically, we will count the number of proteins with multiple different
conserved motifs that interact with multiple different corresponding regulators and compare this to the
number expected in permuted data. For each statistical test above we will identify whether the observed
distribution provides support for that model. In addition, we will identify the examples of unstructured
regions that show the properties expected under each of the models. A graduate student in the Moses
Lab will work on this analysis throughout the duration of the project.
(ii) NMR studies of key examples. The statistical analysis will identify candidate regulatory mechanisms
for many disordered regions and we will choose three to be tested further in NMR studies. We will
determine which proteins are of most interest based on the short conserved segments within the
disordered regions, and analysis of other functional data regarding each protein. In particular we will
choose proteins where regulatory interactions with short linear motifs in the disordered regions have
been established or can be confidently inferred, and can be reconstituted in vitro.
9
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
Previously, we have performed extensive NMR studies on the intrinsically disordered N-terminus of
Sic1p [49,74,76]. In those studies we used purified Cln2-Cdc28 to test the effects of phosphorylation,
and found that both phosphorylated and unphosphorylated forms are clearly disordered [76]. Our
studies revealed that this protein interacts with Cdc4 through a large number of alternating
conformations (Figure 10), and that this regulation is controlled by multi-site phosphorylation of the
Cdc4 binding sites by Cdk1 [76]. Interestingly, even though Sic1p is intrinsically disordered, we have
found that it is fairly compact [76,79]. We hypothesize that compactness is important for the
electrostatic component of the interaction with Cdc4, and we will test whether additional disordered
proteins also are more compact than expected, particularly when they appear consistent with the multisite regulation model (Figure 9b).
We propose to study the mechanism of function of the disordered N-terminus of Pds1p, which is critical
for degradation of this protein by the APCCdc20 [77] and is regulated by the kinases Cdk1 and Chk1
[77,78]. Pds1p (known as securin in mammals) is a conserved regulator of anaphase entry, which
prevents cell cycle progression in the presence of spindle checkpoint activation [91]. It is targeted by
the APC for degradation at the onset of anaphase, and the N-terminus of Pds1p contains a D-box that is
important for regulation by APC [91]. The APC is a large, multi-subunit ubiquitin ligase that regulates
the cell cycle using two activating subunits, Cdc20 and Cdh1/Hct1. Cdc20 recruits substrates to the APC
by binding directly to degradation motifs (D-box and KEN box). Though the APC is too large for NMR
studies, the Cdc20 subunit which itself can bind Pds1p is a good candidate to investigate the role of the
short linear motifs identified on the N-terminus of Pds1p (see below). It has recently been shown that
degradation of Pds1p is controlled by phosphorylation [77], and the N-terminus of Pds1p also contains
two phosphorylation sites for Cdk1, only one of which is conserved over evolution (Figure 11a),
indicating that the multisite regulation model developed for Sic1p [49] is unlikely apply. The Nterminus of Pds1p also contains a phosphorylation site for Chk1 [78], which prevents cell-cycle
progression in the presence of DNA damage by preventing APC-dependent degradation of Pds1.
Therefore, we hypothesize that this disordered region integrates two signals in an ‘OR’ logic: prevent
Pds1 degradation due to spindle checkpoint activation or due to DNA-damage (Figure 11a)
We determine the mechanistic bases by which the disordered region can integrate the signals
from the two kinases, by performing NMR studies on the interaction of the N-terminus of Pds1 with
Cdc20 when either, neither or both of the Cdk1 and Chk1 sites are phosphoryated, Similarly, we will
perform these experiments when these signals have been removed by site-specific mutagenesis.
We have also identified the unstructured N-terminus of Mps1 (a highly conserved mitotic kinase)
as a second candidate for NMR analysis. This protein is a known target of the APC, and contains a Dbox in yeast and human (refs) but these appear at very different locations in the protein (Figure 12c). In
this case we are interested in how the interaction of the unstructured region with the structured binding
partner changes when the short motif has changed location from one end of the unstructured region to
the other. To test this we will make MPS1 N-terminus constructs that contain KEN-boxes at each
location, as well as constructs that have no KEN-boxes and characterize these using NMR.
To confirm the disorder of the putative regions (as these have been identified as disordered by
DISOPRED [23]) we will use NMR spectroscopy to measure the 1HN-15N correlation spectra, which
indicate disorder when the amide proton chemical shift dispersion is narrow and line widths of peaks are
sharp. To test the effects of post-translational modification on the disordered regions, we will
reconstitute the modifications in vitro, and then measure the spectra of the unmodified and modified
forms. To determine which regions of the disordered protein interact with the regulator, we will use
titration experiments in which we will follow the changes in the intensity of the peaks. The transferred
10
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan
Research Proposal
Requested: $137775
cross-saturation (TCS) can also be used to map sites of interaction. We will repeat these experiments
after site-specific mutagenesis of the short conserved segments to understand their impact on the
regulatory interaction. Once the experimental set up has been developed, these experiments will be
performed by a postdoctoral fellow in the Forman-Kay lab over the second two years of the project.
Preliminary data: we have identified the N-terminus of Pds1p as a first candidate for NMR analysis.
Pds1p contains a highly conserved KEN-box and we have made site-specific mutations it to test its
function (Figure 11). This construct will also be used to test the importance of the KEN-box for
interaction with the APC in vitro. We have also begun experiments on the N-terminus of Mps1. We
noticed that it also contains a KEN box, but that this motif is not conserved in its location and therefore
could not be detected by the phyloHMM. Nevertheless, we sought to confirm whether the putative
KEN-box is important for regulation of Mps1 stability. We have performed site-specific mutagenesis
experiments, and our preliminary results indicate that it leads to stabilization of the protein in a shut-off
experiment (Figure 12a,b). Fascinatingly, a KEN-box is found in a different location in this
unstructured region in drosophila and vertebrates, but not in mammals (Figure 12c).
Anticipated problems and solutions:
It may not be possible to associate functions with all novel consensus sequences identified in the cluster
analysis. First, many of the new consensus sequences may be statistical artifacts of the clustering
procedure. To rule this out, we will therefore perform the cluster analysis using multiple different
algorithms and parameter settings. Clusters that appear consistently are more likely to represent bona
fide patterns in the data. Second, because the cluster analysis of short conserved segments is unbiased,
some identified motifs will have no functions that have been studied in the lab and therefore we will
have no hypothesis to direct our tests of these motifs. We will therefore focus on motifs that do show a
statistical association with some known function. Thus, we are once again limited to the ‘low-hanging
fruit’ of novel consensus sequences that do show statistical enrichments. Encouragingly, in our
preliminary data we have already identified several good candidates (Figure 4) and therefore we will
have more than enough to analyze within the period of this proposal.
Not all types of interactions involving disordered regions will be amenable to in vitro analysis
using NMR. In particular, scaffold proteins that have large numbers of binding partners will not be
feasible. We will therefore focus on proteins where the interaction can be reconstituted in vitro.
Similarly, disordered proteins can be difficult to work with experimentally, and therefore we have
designed the research proposal so that we are not specifically tied to any single protein for our
experiments. For example, if we cannot purify the Pds1 N-terminus, or obtain reliable NMR data, we
will move on to the Mps1 N-terminus, or other interesting candidates identified from the bioinformatics
analysis.
IV. Significance
This project will systematically identify post-translational regulatory sequences (short linear motifs)
within intrinsically disordered regions (Specific Aim 1). This represents a new approach to cracking the
“code” that controls protein regulation. Furthermore, we will test whether the current models of
disordered region function can explain regulatory functions and characterize several examples in
mechanistic detail (Specific Aim 2). For example, we aim to understand how disordered regions can
integrate information, and how they can retain their function when short linear motifs change location
over evolution. Our studies will provide insight into the functions of disordered regions, which are found
in many important disease genes.
11