* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download proposal-aug25
Biochemistry wikipedia , lookup
Non-coding DNA wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Gene expression wikipedia , lookup
Molecular evolution wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene regulatory network wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Point mutation wikipedia , lookup
List of types of proteins wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein domain wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Western blot wikipedia , lookup
Interactome wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein adsorption wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Protein–protein interaction wikipedia , lookup
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 I. Overview It is increasingly appreciated that many disease associated proteins contain regions of intrinsic disorder. However, relatively little is understood about the functions of these regions and it is not currently possible to predict the impact of mutations in these regions. We propose to analyze the function of disordered regions using a new systematic computational approach. We will then focus on specific examples to experimentally test the mechanisms of function of the identified disordered regions. This proposal represents a new approach to attack a difficult problem in protein biochemistry: the function of intrinsically disordered proteins. II. Background and Motivation As many as 50% of human proteins are thought to contain intrinsically disordered regions [1,2], including many important disease associated proteins such as p53 [3], BRCA1[ 4] and CFTR[5]. Although important biological functions have been described for specific examples of disordered regions, little is known in general about the sequence-function relationship for most residues in these regions [6-10]. One established model is that disordered regions are important for protein regulation [1], and contain short linear motifs (short peptide sequences important for protein interaction [11]). We propose to apply this model systematically using the ‘comparative genomics’ paradigm, which exploits the observation that functional sequences are preferentially preserved over evolution [12-14]. Because the short linear motifs within disordered regions are important for function, they are expected to be preferentially conserved relative to flanking residues. We and others have successfully exploited such evolutionary conservation to identify thousands of conserved regions in non-coding DNA [15,16], and these have been demonstrated to have important functions in transcriptional regulation [17]. Evolutionary methods have also been widely applied to protein sequences to detect remote homologues and to identify critical functional residues and motifs [18-21]. Here we propose to develop an evolutionary method to identify short conserved segments within disordered regions of proteins, as a means to identify the short linear motifs important for function. This application of ‘comparative genomics’ is (to our knowledge) novel. We will therefore devote considerable attention to confirming the effectiveness of the method in this new context. If successful, the methods we develop will represent a substantial contribution and will be generally applicable beyond the scope of this project. Why an evolutionary approach for disordered regions? For the vast majority of intrinsically disordered regions, specific biological functions remain unknown. Computational methods can readily identify the disordered regions based on amino acid sequence composition [22] (see [23] for review) and in principle, function could be predicted by previously developed computational methods to recognize short linear motifs [24-28] (reviewed in [29,30]). Unfortunately, these motifs, typically only 3-8 amino acids long, are not statically significant when search algorithms are applied at the proteome scale [31]; recent approaches have therefore resorted to relying on external functional data [32,33]. More general computational methods to predict interaction sites have also been developed [3439] (reviewed in [40,41]), but these methods show little predictive power when structural information is not available [40]. They are therefore not applicable to the majority of disordered regions. For example, ANCHOR [37] was developed to identify binding sites in intrinsically disordered regions, and classifies between disordered binding sites and globular proteins with high accuracy [37]. Nevertheless, its predictions of binding sites are not specific enough to identify short functional motifs (see below). Evolutionary approaches can predict functional residues with great power [19,42], and given the extensive sequence databases becoming available for closely related species, their application to 1 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 disordered regions is timely. An evolutionary approach will also yield hypotheses about the regulatory function of intrinsically disordered regions. Although some disordered regions become ordered upon binding, many are thought to remain in predominantly disordered conformations in vivo [43] and their functions often take advantage of this structural flexibility [44]. In addition, disordered regions may act as scaffolds for signaling proteins [45,46], switches for regulation of protein stability or may fine-tune biochemical activity [47-49]. These functions will influence the composition and evolution of the short linear motifs within the disordered region. We will analyze the functions of the disordered domains based on the short linear motifs they contain. Focusing on phosphorylation sites, we will confirm the functions of conserved motifs using site-specific mutagenesis and in vitro kinase assays. We will also explore the mechanisms of function of three selected disordered regions using NMR. We and others have previously applied NMR to explore the structural features of disordered regions in several proteins, including CFTR, Sic1, I-2, spinophilin and others [47-49] (reviewed in [50]). The combination of detailed mechanistic studies with the genomewide unbiased analysis of disordered regions will allow us to generalize beyond the small number of examples that can be characterized in detail. III. Specific aims and research plan Aim 1: Use evolutionary conservation to identify functional elements within disordered regions. Aim 2: Experimentally analyze disordered regions containing predicted short linear motifs. Please see Appendix 2 for a schematic outline of the proposal and Appendix 3 for information about completion of specific tasks Specific Aim 1. Identify conserved segments within disordered regions Natural selection is expected to remove mutations in functionally important sequences, leading to slower evolution in short linear motifs. Indeed, we and others have recently shown that characterized short linear motifs are conserved relative to the flanking amino acid sequences [51-55]. Indeed, conservation of short linear motifs has already been used to search for examples of known motifs [20,21]. To exploit evolutionary conservation to systematically identify functional elements within disordered regions, we propose to use a state-of-the-art probabilistic model known as a 'phyloHMM' [56] to identify short stretches of amino acid sequence that are evolving more slowly than the immediately flanking sequences. Briefly, our phyloHMM follows previous work [56] by assuming that each column in a multiple sequence alignment (Figure 1a) can be classified into a ‘conserved state’ or a ‘background’ state; it then reports for each residue the posterior probability that it falls in the conserved state (Figure 1b). PhyloHMMs explicitly account for the similarity of sequences related by a phylogenetic tree and therefore can extract the maximum signal from the multiple alignments. Because insertions and deletions are prevalent in disordered regions, we will modify the standard probabilistic models underlying the phyloHMM [57,58] to include an insertion/deletion process (in addition to the standard substitution process). These phyloHMMs will be implemented by Alex Nguyen Ba, a PhD student in the Moses lab. Choosing species for the analysis: We will start with analysis of the budding yeast proteome. Because the computational methods rely on alignments of orthologous sequences, the high-quality genome annotations and syntenic orthologs available for budding yeast from YGOB [59]. Proteins in budding yeast and its relatives usually contain only a single exon, and a single transcript. This makes gene 2 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 prediction identification and assignment of 1 to 1 relationships between protein coding sequences from multiple species relatively straightforward. Because of the much greater complexity of vertebrate gene structures and large transcript numbers the bioinformatics analysis is more time-consuming. A graduate student will be recruited to the Moses Lab to extend and apply these methods to the human proteome. Meanwhile, we have focused our energy on the budding yeast model system. Confirming the phyloHMM methodology: To our knowledge the application of phyloHMMs to disordered regions in proteins is novel. We will therefore perform two important tests of the methodology in application to alignments of yeast and vertebrate proteins: (i) confirm that “conserved” short segments are accurately aligned, (ii) confirm that when short linear motifs are actually conserved, the phlyoHMM has the power to detect them. We will compute ROC curves showing true positive and false-positive rates as we vary the assumptions about the evolution of the short linear motifs and the surrounding sequences. (i) Confirm alignment accuracy: Due to the rapid evolution of intrinsically disordered regions, truly conserved short linear motifs may be aligned incorrectly in the multiple sequence alignments. We will perform a series of simulations of molecular evolution of disordered regions. In these simulated proteins, we can insert conserved protein domains, as well as conserved short linear motifs. Because we know the short linear motifs in these simulated proteins are conserved, we can test whether the aligner can correctly align them. Preliminary results: In our simulations, for typical motif evolutionary rates, e.g., 25% of background, we find that ~95% of the artificial motifs are correctly aligned (Figure 2a). (ii) Confirm phyloHMM power: We will run our phyloHMM on the simulated alignments and test whether it can identify the short linear motifs that we know are conserved. For a more realistic measure of power, we will also curate a set of experimentally characterized short linear motifs from the literature and identify which of these are conserved. We will then run our phyloHMM method on the proteins that contain these bona fide short linear motifs and test whether it can correctly identify them. Preliminary results: Our preliminary analysis based on simulated proteins indicates that using the phyloHMM we can identify ~50% of conserved motifs as short as 2 amino acids long (false negative rate = 50%, Figure 2b), with false positive rates of less than 1 per 2500 amino acids (Figure 2c). By comparison, we also applied ANCHOR [37] to simulated disordered proteins. Even when we planted no short linear motifs in these sequences, it predicted 47% (±9%) of amino acids to be within protein binding sites. Figure1b, Figure 6 and Figure 10 show direct comparison of the phyloHMM and ANCHOR. To test the predictive power of the phyloHMM on real short linear motifs, we curated from the yeast literature 106 experimentally characterized short linear motifs that are preserved in at least 90% of the species we used for multiple alignments [59]. When we analyze this data set, we find that 68% are identified by the phyloHMM at a posterior probability threshold of 0.6, for a false negative rate of 32%. Despite the high false negative rate, these preliminary results indicate that it is possible to identify short conserved segments in unstructured regions with the characteristics of short linear motifs. Genome-wide identification of conserved segments in yeast and human disordered regions: Once we have confirmed that the phyloHMM method can be applied to identify conserved segments in disordered regions, we will systematically apply it to alignments of all proteins from budding yeast (and related fungi [59]) and human (and related vertebrates [60]). We will then identify the disordered regions and protein domains in the human and yeast proteins (using DISOPRED [23] and Pfam [61]) and collect the 3 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 short conserved segments that fall within disordered regions, but do not match any known protein domains. To estimate the expected number of false positives identified in the cluster analysis we will repeat the clustering on the motifs identified in simulated disordered proteins. These computational studies will be performed by a PhD student in the Moses lab in the first years of the project. To our knowledge, this analysis will represent a totally novel approach to identify (and quantify) the functional content of disordered regions. Preliminary results: Applying our phyloHMM to alignments of the entire yeast proteome yields >7000 short conserved segments, of which we expect only ~ 200 to be false positives. This provides (to our knowledge) the first unbiased characterization of the amount of functional amino acid sequence within intrinsically disordered regions. At the residue level, this indicates that at least ~5% of the amino acids in intrinsically disordered regions are under specific evolutionary constrains and are therefore very likely to have biological functions. To confirm that the short conserved segments are indeed short linear motifs, we have searched for conserved segments that match known short linear motifs. For example, the FG motif is found in disordered regions of nuclear pore complex (NPC) proteins [67]. Of the 30 components of the NPC, 13 are known to contain FG repeats which are thought to be biologically important for the nuclear import and export of proteins [67]. To test whether the phyloHMM approach can identify these, we searched the conserved segments for those that matched the FG-motif. We found only 59 proteins that contained conserved FG-motifs, including 12 of the 13 previously known examples (Figure 3b). Since there are 3438 proteins for which alignments are available, this represents a highly significant enrichment: 12/59 vs 13/3438, P-value = 7.21 x 10-16. In another test of our method, we applied a similar statistical analysis to the set of conserved elements containing the canonical phosphorylation site consensus sequence (S/T-P-x-R/K) of the Cdc28 kinase. Of 695 proteins tested in a high-throughput in vitro kinase assay [68], our phylo-HMM identifies 40 proteins containing a short conserved sequence which matches the Cdc28 consensus sequence. Of those, 32 (80%) were found to be positive in their assay which is a highly significant enrichment (32/40 vs 185/695, P-value = 1.4 * 10-11). Of the 8 remaining proteins that contain conserved consensus sites, but were not identified as positives in the assay, one is a known substrate of Cdc28p (Cdc15p [69], but was negative in the assay). Two others are known substrates of the Pho85p kinase (Rim15p [70]) and the Fus3p kinase (Fus2p [71]), both of which can phosphorylate the canonical Cdc28p consensus sequence, indicating that the phyloHMM can identify bona fide phosphorylation sites that are missed in the kinase assays. Similarly, we have been able to identify new examples of other known short linear motifs, such as phosphorylation sites for Cbk1 (See specific Aim2 and the KEN box (Figure 3d) using the phyloHMM method. These results are very encouraging, and indicate that evolutionary conservation in disordered regions is a strong predictor of biological function. This preliminary data indicates that our phyloHMM can identify thousands of short conserved segments in the disordered regions of the budding yeast proteome, and when these conserved segments match known short linear motifs they are likely to perform the predicted biological function. Identification of novel short linear motifs: Interestingly, despite the strong statistical results regarding previously known short linear motifs, we note that most of the identified conserved segments do not match any known short linear motif [11]. We hypothesize that these conserved segments represent examples of previously unrecognized short linear motifs. We will therefore attempt to discover these motifs by identifying families of conserved segments with similar amino acid sequences. We will use graph-based clustering methods to identify groups of conserved segments based on sequence similarity. These “clusters” may represent known and novel short linear motifs. To test whether they are associated 4 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 with biological function, we will compare the number of proteins in the cluster with a given functional annotation to the number of proteins with that annotation expected by chance [63-65]. Statistical overrepresentation (or enrichment) of a particular function leads us to propose that function for the novel motif. Please see preliminary results below for examples of this analysis, which we consider a particularly exciting aspect of our proposal -- it will allow us to assign putative functions to completely novel short linear motifs. As above, we will first apply this to budding yeast, and then extend the analysis to alignments of vertebrate proteins. Preliminary results: To associate the thousands of novel conserved segments with functions, we have performed a graph-based clustering of conserved segments identified in budding yeast alignments using MCODE [62]. In addition to identifying many known motifs (Figure 3), the cluster analysis revealed hundreds of new consensus sequences that were not previously recognized as short linear motifs. Within these clusters, we have 30 unknown consensus sequences with 20 or more conserved examples in the yeast proteome. Although many of the novel consensus sequences identified by the cluster analysis might be due to random noise, several of these new putative consensus sequences are statistically associated with biological functions and are therefore are very likely to represent novel, previously unrecognized short linear motifs (see Specific Aim 2 below). These include a previously unreported DSF motif that is associated with amino acid permeases (6/8 vs. 36/3438, P-value = 2.4 * 10-11,), an NPY motif associated with vesicle and nuclear membrane proteins (7/12 vs. 419/3438, Pvalue = 1.7 * 10-5) and an FxFP motif statistically enriched in proteins that physically interact with Cbk1 (Figure 4). Preliminary data (See specific Aim 2 below) indicates that this motif is a bona fide interaction motif for Cbk1. Beyond the phyloHMM: While phyloHMMs represent a novel and potentially powerful approach to identify functional elements in disordered regions, they are limited to those conserved segments that are preserved at the same location in all species. In fact, preliminary data indicates this is not the case for many bona fide short linear motifs. For example, we have performed extensive searches of the literature to identify 530 experimentally characterized short linear motifs in budding yeast. Of these, 305 were found in disordered regions, but only 106 were conserved in multiple alignments when inspected by eye (e.g., Figure 5a). Therefore, the “conserved” segments in disordered regions will represent only a subset of the functional elements in these regions. While our phyloHMM methods can identify this subset with great power, they are limited to this ‘low-hanging fruit’ of alignment conservation. To address this, we will also develop methods to identify conserved that are not conserved in the alignments. We refer to this type of conservation as “alignment-free conservation”. To detect it, we will consider matches to a consensus motif occurring according to a birth-death process with rates specified by a “background” amino acid substitution process that has no specific selection to retain matches to the consensus. If motif matches are retained over evolution beyond what is expected based in this process, we can infer that selection has acted to preserve them (Figure 5b). To apply this approach when we do not know the consensus in advance we will scan along the alignment and test each short sequence as a potential motif. Preliminary results: We have implemented an algorithm that searches for “alignment-free conservation” of matches to consensus motifs. We compute the distribution number of motif matches in orthologous sequences based on the extant sequence in a reference species (budding yeast) under a background model with no selection pressure to retain them (Figure 5b). When we search the yeast proteome using this algorithm and known phosphorylation consensus sequences, we can find significant 5 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 enrichment of substrates for multiple kinases (Figure 6b). In addition, we have identified proteins where consensus sites are not conserved in the alignments, but are very likely to represent novel substrates. For example, at least 4 proteins in the DNA damage signaling pathway contain highly conserved clusters of non-aligned matches to the Mec1 consensus motif (Figure 6a). Since Mec1 directly targets 4 other proteins in this pathway, we believe our novel predictions are very promising (Figure 6a). Indeed proteins with evidence for alignment-free conservation of the Mec1 consensus are significantly enriched for “response to DNA damage stimulus” proteins (P-value = 2.1 * 10-9). Anticipated problems and solutions: To our knowledge phyloHMMs have not yet been employed for analysis of disordered regions, but they have been successfully applied to DNA sequences, leading to the widely used phastcons 'conservation tracks' at the UCSC genome browser [66]. Since protein evolution is much more heterogeneous than DNA evolution, if the method employed for DNA sequences (i.e., searching for the most highly conserved regions [56]) were applied to proteins, it would identify slowly evolving regions (or structural domains) rather than short linear motifs. Similarly, if applied naively, our methods that do not rely on strict positional conservation would simply identify motif matches that happen to occur in highly conserved regions of proteins. To address this potential problem, we use a ‘local’ rate of evolution against which we compare the expected pattern of evolution. For the phyloHMM, the background rate of is estimated using a 20 amino acid sliding window across the alignment, and the conserved rate is estimated by taking the maximum likelihood estimate of the rate at that position up to 1/3 the background rate. For our alignment-free method, we use a maximum likelihood estimate of the local evolutionary distance in a window surrounding the consensus sequence we are testing (Figure 5b). For the phyloHMM we also filter out known protein domains (using Pfam [61]). Another important technical challenge is the large numbers of insertions and deletions that are found within alignments of disordered regions. Because insertions and deletions violate the assumption that each column in an alignment is independent, standard probabilistic phylogenetic models treat substitutions only. We sought model the “gaps” with a compromise between computational feasibility and biological realism. To do so, we divide the multiple sequence alignment into “blocks” of constant gap size (illustrated as black vertical lines around grey shaded areas in Figure 1a,c). While each block does not necessarily only include one insertion and deletion event, this is a much better approximation than treating the columns independently. We can then compute probability of each block based on the gap of size and the distribution on the phylogenetic tree (Figure 1d, eq. 1). We will assume the gap process and substitution process are independent, so that the likelihood is simply the product of the two (Figure 1d, eq. 2). This model allows relatively simple parameter estimation by numerically maximizing the likelihood. Specific Aim 2. Experimental analysis of predicted short linear motifs and disordered regions Because application of the phyloHMM to disordered regions is novel, to demonstrate the power of the methodology we will confirm several predictions using site-specific mutagenesis, in vitro kinase assays and fluorescence microscopy. We will then turn to more detailed NMR analysis to test hypotheses about the mechanisms of regulation mediated by the disordered regions. 2a) Experimental confirmation of computational predictions. To show that the evolutionarily conserved segments in disordered regions actually represent bona fide short linear motifs, we will test specific examples experimentally in budding yeast, including (i) novel examples of previously known consensus sequences and (ii) novel consensus sequences identified in our cluster analysis. These 6 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 experiments will provide empirical support for the evolutionary proteomics approach to identifying functional elements within intrinsically disordered regions. Indeed, our preliminary data below indicates that when tested, the short conserved segments identified by the phyloHMM are likely to function as short linear motifs. (i) New examples of known motifs. To demonstrate that our methods can predict novel phosphorylation sites for known protein kinases we will perform in vitro kinase assays. We will focus on substrates of the NDR/LATS family protein kinase Cbk1 in budding yeast. This family of kinases is important for determination of cell-growth and development and is conserved between yeast and humans [72], but few direct substrates have been identified. We will search the short conserved segments and identify matches to the Cbk1 consensus. The disordered regions that contain these consensus sites will be tested in in vitro kinase assays with Cbk1 purified from yeast cells in collaboration with Eric Weiss’ lab at Northwestern University (Figure 7). To confirm that the in vitro phosphorylation is due to the Cbk1 kinase, we will repeat these experiments in cells without the Cbk1 kinase activity (Figure 7). We aim to identify and test 5 novel Cbk1 substrates in collaboration with the Weiss lab in the second year of the proposal. While this will not be an exhaustive enumeration of substrates, it will be a large enough number to demonstrate that our methods can identify new substrates, and we will not be particularly wedded to any individual protein if we find that it is difficult to purify or work with experimentally. Preliminary data: We identified conserved consensus matches to the Cbk1 consensus [80] in the Nterminus of Sec3p which is predicted to be disordered. In collaboration with Eric Weiss’ group at Northwestern University, we made alanine mutations in the predicted phosphorylation sites (Figure 7d) and subjected the Sec3p N-terminus to an in vitro phosphorylation assay. Preliminary results (Figure 7e) indicate that the N-terminus of Sec3p is a very good substrate for Cbk1 in vitro and that phosphorylation of this protein is also important in vivo (data not shown). We have also identified additional candidate Cbk1 substrates: Fir1 and Tao3 which were reported to physically interact with Cbk1 (refs), and Mpt5 an RNA-binding protein (Figure 7f-h) like Ssd1 as known Cbk1 susbstrate. In our preliminary cluster analysis of conserved segments in disordered regions (Figure 3a), we identified a cluster corresponding to the KEN-box degradation signal recognized by the APCCdc20 . Only 10 proteins had a short conserved segment matching the KEN sequence. Eight of those contained an experimentally verified KEN degradation signal [81-83], were characterized targets of the APCCdc20 [84,85] or were cyclins, one of which contains a verified KEN sequence [83]. The two remaining conserved segments matching the KEN signal (in Spt21p and Sgd1p) have not been associated with the APC or known to show cell-cycle regulated degradation. Spt21p is a protein involved in regulating histone transcription and its transcription is cell-cycle regulated [86,87]. Furthermore, over-expression of Spt21p has been shown to be toxic [88], suggesting a requirement for tight control on protein levels. We therefore decided to test if the identified KEN sequence in Spt21p was truly a degradation signal. We first tested if protein abundance was cell-cycle regulated and found that Spt21p protein levels coincide with Clb2p protein levels (Figure 8b) indicating that, as at the level of mRNA, Spt21p protein levels vary over the cell cycle. Given the toxicity of over expression, we reasoned that if the KEN sequence is a biologically relevant degradation signal, then over expression of a KEN-mutant form of Spt21p would be more toxic than a wt form. We therefore mutated the KEN-box to alanines and observed that growth was more severely affected in the mutant over-expression that wt (Figure 8d). This phenotypic effect is consistent with the hypothesis that the evolutionary conserved sequence is important. Finally, to test the stability of the Spt21p KEN-mutant protein, we assayed protein levels of wt Spt21p and mutant by over-expressing the protein with the GAL promoter and then shutting off both 7 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 transcription and translation. Indeed, we observed that the mutant protein levels remained high, while the wt degraded over time (Figure 8e). (ii) Assign functions to novel motifs. Our cluster analysis will identify a large number of new consensus sequences that have not been previously recognized. In Specific Aim 1 we will associate these with putative functions through statistical overrepresentation of functional annotation in the cluster [63-65]. To test these putative functions, we will make site-specific mutations in the novel motifs. For example, proteins containing conserved FxFP motifs were associated with physical interactions with Cbk1. We will therefore test whether mutations in this short peptide can disrupt interactions with this kinase. Similarly, a previously unreported DSF motif was associated with amino acid permeases. We hypothesize that this motif will be important for targeting of the permeases to the correct localization. We will therefore make N-terminal GFP fusion proteins and follow their localization using fluorescence microscopy before and after mutagenesis of this motif. We will also test whether mutations in this motif impair permease function by measuring cell growth in auxotrophic strains. These experiments will be performed by a technician in the Moses Lab throughout the project. Preliminary data: One of the novel motifs identified in our cluster analysis was an FxFP motif that was statistically enriched in proteins that were found to interact with Cbk1 in high-throughput studies (Figure 4d “x” [89,90]). We noted that this motif resembled the docking motif that has been reported for MAPKs, so we decided to test whether this motif represented a novel docking site for the Cbk1 kinase. To test this, in collaboration with Brian Yeh in Eric Weiss’ lab, we expressed short peptide fragments containing the conserved segment and tested whether they could bind the Cbk1 kinase domain in an amylose resin pull-down assay (Figure 4e). Amazingly, all 6 of these peptides showed binding in this assay, with 4/6 showing strong binding. This confirms that this short motif can mediate direct interactions with the Cbk1 kinase domain. This preliminary data suggests that this motif is a novel docking site for this kinase and supports the idea that the novel patterns identified through cluster analysis represent bona fide short linear motifs motifs. 2b) Testing models for the mechanisms of regulation. Several models have been proposed for the mechanistic function of disordered regions. Perhaps most familiar is the function of disordered proteins as scaffolds for signaling and protein complex formation (Figure 9a [44]). Scaffolds are an important mechanism to ensure specificity in cell-signaling as they bring otherwise potentially promiscuous enzymes in close physical proximity. For example, MAPKs form canonical three-kinase signaling cascades, where each kinase has great specificity for the next. However, this specificity is not always encoded in the direct interactions between kinases, but rather by scaffold proteins that physically link each kinase to the next. This is one important mechanism by which kinases can be reused in multiple signaling pathways, but avoid “cross-talk”. A second well-characterized mechanism for disordered regions is the “multisite regulation” model where multiple regulatory sites in disordered regions are important for ultrasensitive “switchlike” responses (Figure 9b [49,73]). The paradigmatic example of this mechanism is the cell-cycle regulator Sic1p, which undergoes multi-site phosphorylation to ensure switch-like onset of G1 phase. Recently, we have demonstrated that although the N-terminus of Sic1p is disordered, it adopts a very compact three-dimensional conformation and interacts with its binding partner (Cdc4) in many rapidly alternating states (Figure 10 [49]). Thus, in the case of Sic1, the flexibility of the disordered protein seems critical to allow all of the sites to interact with Cdc4, thus providing the mechanistic basis for the switch-like regulation [74]. 8 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 Finally, disordered regions can integrate signals from multiple binding partners (Figure 9c, [47]). Regulatory proteins and internal binding domains can compete for binding sites within a protein to activate or inhibit the biochemical function. For example, CFTR, the protein mutated in cystic fibrosis [47], contains a large disordered region known as the R-region that is important for controlling the activity of the CFTR channel [47]. The R-region contains multiple binding sites for interaction partners (such as NBD1) whose binding propensity is modulated by phosphorylation. Depending on the propensity of these binding sites for their interacting proteins, they will be either highly bound or largely free. These multiple binding sites in the disordered region therefore integrate multiple signals to control activation of the channel. Because of their intrinsic structural flexibility, disordered proteins can have many more binding partners than ordered proteins and therefore are well-suited to these types of functions [44]. All of these models make predictions about the organization and evolution of the functional sequences within the disordered regions. Based on the patterns of conserved segments within the disordered regions identified by the phyloHMM (Specific Aim 1) we will analyze the mechanism of function by (i) integrating high-throughput functional data with our phlyoHMM predictions to test these models systematically and (ii) testing examples of specific proposed mechanisms using NMR studies. (i) Statistical tests of models for disordered region function. The scaffold model predicts that disordered proteins will contain large numbers of short conserved protein binding motifs and show a large number of protein-protein interactions with related biochemical functions. For example, Las17p contains 7 conserved putative SH3 binding sites, and binds to 9 SH3 containing proteins in highthroughput studies [75], many of which have functions related to actin assembly. To test this statistically, we will randomly permute the large-scale protein interaction data for each protein and test for an excess of proteins with many conserved binding motifs associated with large number of protein interactions. The multisite site regulation model predicts that proteins will have a large number of conserved motifs that match the same consensus sequence, and interact with only a single regulator that recognizes that consensus. For example, the N-terminus of Sic1p contains 9 weak binding sites for Cdc4 [73,76] that contribute to switch-like degradation of this protein. Most of these weak sites are highly conserved, and therefore readily identified by the phyloHMM (Figure 1). To test for this statistically, we will compare the number of proteins with large number of the same conserved motifs and a small number of protein interactions to that expected if the protein motifs were distributed randomly amongst the disordered regions. Finally, integrator model predicts that disordered regions will many different conserved segments and interact with multiple regulators that correspond to those short linear motifs. To test this statistically, we will count the number of proteins with multiple different conserved motifs that interact with multiple different corresponding regulators and compare this to the number expected in permuted data. For each statistical test above we will identify whether the observed distribution provides support for that model. In addition, we will identify the examples of unstructured regions that show the properties expected under each of the models. A graduate student in the Moses Lab will work on this analysis throughout the duration of the project. (ii) NMR studies of key examples. The statistical analysis will identify candidate regulatory mechanisms for many disordered regions and we will choose three to be tested further in NMR studies. We will determine which proteins are of most interest based on the short conserved segments within the disordered regions, and analysis of other functional data regarding each protein. In particular we will choose proteins where regulatory interactions with short linear motifs in the disordered regions have been established or can be confidently inferred, and can be reconstituted in vitro. 9 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 Previously, we have performed extensive NMR studies on the intrinsically disordered N-terminus of Sic1p [49,74,76]. In those studies we used purified Cln2-Cdc28 to test the effects of phosphorylation, and found that both phosphorylated and unphosphorylated forms are clearly disordered [76]. Our studies revealed that this protein interacts with Cdc4 through a large number of alternating conformations (Figure 10), and that this regulation is controlled by multi-site phosphorylation of the Cdc4 binding sites by Cdk1 [76]. Interestingly, even though Sic1p is intrinsically disordered, we have found that it is fairly compact [76,79]. We hypothesize that compactness is important for the electrostatic component of the interaction with Cdc4, and we will test whether additional disordered proteins also are more compact than expected, particularly when they appear consistent with the multisite regulation model (Figure 9b). We propose to study the mechanism of function of the disordered N-terminus of Pds1p, which is critical for degradation of this protein by the APCCdc20 [77] and is regulated by the kinases Cdk1 and Chk1 [77,78]. Pds1p (known as securin in mammals) is a conserved regulator of anaphase entry, which prevents cell cycle progression in the presence of spindle checkpoint activation [91]. It is targeted by the APC for degradation at the onset of anaphase, and the N-terminus of Pds1p contains a D-box that is important for regulation by APC [91]. The APC is a large, multi-subunit ubiquitin ligase that regulates the cell cycle using two activating subunits, Cdc20 and Cdh1/Hct1. Cdc20 recruits substrates to the APC by binding directly to degradation motifs (D-box and KEN box). Though the APC is too large for NMR studies, the Cdc20 subunit which itself can bind Pds1p is a good candidate to investigate the role of the short linear motifs identified on the N-terminus of Pds1p (see below). It has recently been shown that degradation of Pds1p is controlled by phosphorylation [77], and the N-terminus of Pds1p also contains two phosphorylation sites for Cdk1, only one of which is conserved over evolution (Figure 11a), indicating that the multisite regulation model developed for Sic1p [49] is unlikely apply. The Nterminus of Pds1p also contains a phosphorylation site for Chk1 [78], which prevents cell-cycle progression in the presence of DNA damage by preventing APC-dependent degradation of Pds1. Therefore, we hypothesize that this disordered region integrates two signals in an ‘OR’ logic: prevent Pds1 degradation due to spindle checkpoint activation or due to DNA-damage (Figure 11a) We determine the mechanistic bases by which the disordered region can integrate the signals from the two kinases, by performing NMR studies on the interaction of the N-terminus of Pds1 with Cdc20 when either, neither or both of the Cdk1 and Chk1 sites are phosphoryated, Similarly, we will perform these experiments when these signals have been removed by site-specific mutagenesis. We have also identified the unstructured N-terminus of Mps1 (a highly conserved mitotic kinase) as a second candidate for NMR analysis. This protein is a known target of the APC, and contains a Dbox in yeast and human (refs) but these appear at very different locations in the protein (Figure 12c). In this case we are interested in how the interaction of the unstructured region with the structured binding partner changes when the short motif has changed location from one end of the unstructured region to the other. To test this we will make MPS1 N-terminus constructs that contain KEN-boxes at each location, as well as constructs that have no KEN-boxes and characterize these using NMR. To confirm the disorder of the putative regions (as these have been identified as disordered by DISOPRED [23]) we will use NMR spectroscopy to measure the 1HN-15N correlation spectra, which indicate disorder when the amide proton chemical shift dispersion is narrow and line widths of peaks are sharp. To test the effects of post-translational modification on the disordered regions, we will reconstitute the modifications in vitro, and then measure the spectra of the unmodified and modified forms. To determine which regions of the disordered protein interact with the regulator, we will use titration experiments in which we will follow the changes in the intensity of the peaks. The transferred 10 An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions MOSES, Alan Research Proposal Requested: $137775 cross-saturation (TCS) can also be used to map sites of interaction. We will repeat these experiments after site-specific mutagenesis of the short conserved segments to understand their impact on the regulatory interaction. Once the experimental set up has been developed, these experiments will be performed by a postdoctoral fellow in the Forman-Kay lab over the second two years of the project. Preliminary data: we have identified the N-terminus of Pds1p as a first candidate for NMR analysis. Pds1p contains a highly conserved KEN-box and we have made site-specific mutations it to test its function (Figure 11). This construct will also be used to test the importance of the KEN-box for interaction with the APC in vitro. We have also begun experiments on the N-terminus of Mps1. We noticed that it also contains a KEN box, but that this motif is not conserved in its location and therefore could not be detected by the phyloHMM. Nevertheless, we sought to confirm whether the putative KEN-box is important for regulation of Mps1 stability. We have performed site-specific mutagenesis experiments, and our preliminary results indicate that it leads to stabilization of the protein in a shut-off experiment (Figure 12a,b). Fascinatingly, a KEN-box is found in a different location in this unstructured region in drosophila and vertebrates, but not in mammals (Figure 12c). Anticipated problems and solutions: It may not be possible to associate functions with all novel consensus sequences identified in the cluster analysis. First, many of the new consensus sequences may be statistical artifacts of the clustering procedure. To rule this out, we will therefore perform the cluster analysis using multiple different algorithms and parameter settings. Clusters that appear consistently are more likely to represent bona fide patterns in the data. Second, because the cluster analysis of short conserved segments is unbiased, some identified motifs will have no functions that have been studied in the lab and therefore we will have no hypothesis to direct our tests of these motifs. We will therefore focus on motifs that do show a statistical association with some known function. Thus, we are once again limited to the ‘low-hanging fruit’ of novel consensus sequences that do show statistical enrichments. Encouragingly, in our preliminary data we have already identified several good candidates (Figure 4) and therefore we will have more than enough to analyze within the period of this proposal. Not all types of interactions involving disordered regions will be amenable to in vitro analysis using NMR. In particular, scaffold proteins that have large numbers of binding partners will not be feasible. We will therefore focus on proteins where the interaction can be reconstituted in vitro. Similarly, disordered proteins can be difficult to work with experimentally, and therefore we have designed the research proposal so that we are not specifically tied to any single protein for our experiments. For example, if we cannot purify the Pds1 N-terminus, or obtain reliable NMR data, we will move on to the Mps1 N-terminus, or other interesting candidates identified from the bioinformatics analysis. IV. Significance This project will systematically identify post-translational regulatory sequences (short linear motifs) within intrinsically disordered regions (Specific Aim 1). This represents a new approach to cracking the “code” that controls protein regulation. Furthermore, we will test whether the current models of disordered region function can explain regulatory functions and characterize several examples in mechanistic detail (Specific Aim 2). For example, we aim to understand how disordered regions can integrate information, and how they can retain their function when short linear motifs change location over evolution. Our studies will provide insight into the functions of disordered regions, which are found in many important disease genes. 11