* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download embor2011116-sup-0001
Magnesium transporter wikipedia , lookup
Protein moonlighting wikipedia , lookup
Western blot wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Molecular evolution wikipedia , lookup
Expanded genetic code wikipedia , lookup
Peptide synthesis wikipedia , lookup
Protein adsorption wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Cell-penetrating peptide wikipedia , lookup
Protein folding wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Bottromycin wikipedia , lookup
Biochemistry wikipedia , lookup
Protein domain wikipedia , lookup
Metalloprotein wikipedia , lookup
Genetic code wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Prediction of amyloid aggregation in vivo by Mattia Belli, Matteo Ramazzotti and Fabrizio Chiti SUPPLEMENTARY INFORMATION Supplementary Table 1: The modus operandi of the algorithms listed in Table 1 Name Reference Modus Operandi Chiti and Dobson Chiti et al, 2003 It is based on an empirical equation that yields the change in the rate of protofibril formation of an unstructured peptide or protein following mutation, as a function of the change, at the site of mutation, in both hydrophobicity and propensity to convert from -helical to -sheet structure of the chain as well as the change in net charge of the entire protein. TANGO Fernandez-Escamilla et al, 2004 TANGO is a statistical mechanics algorithm able to predict -sheet aggregation of proteins (not amyloid formation although a correlation exists between the two phenomena). While scanning a given amino acid sequence, TANGO evaluates the probability, for each residue, of adopting one of the major conformational states, including the -aggregate, by estimating the energy from statistical and empirical considerations and assuming that in -aggregates the core regions are buried. The algorithm considers protein stability and intrinsic factors of the polypeptide chain (hydrophobicity, electrostatic interactions, hydrogen-bonding contributions, structural conformation propensities), as well as extrinsic factors (pH, protein concentration, ionic strength, TFE concentration). According to TANGO, a peptide segment has -aggregation tendency if it includes at least five consecutive residues with a probability to populate the -aggregate state higher than 5% per residue. TANGO was thus the first algorithm proposed to predict, in addition to the effect of a given mutation in a given sequence, the sequence segments that promote aggregation and form the core regions in the resulting -aggregates. DuBay et al DuBay et al, 2004 It adds to the Chiti and Dobson equation to predict the absolute rate of fibril formation (elongation phase) of polypeptide chains from fully or partially unfolded states under different conditions and without any experimental information on the regions most sensitive to aggregation. It accounts for extrinsic factors such as pH, ionic strength and peptide concentration, and replaced the secondary structure propensity factor with another one that accounts for the presence of patterns of alternating hydrophobic and hydrophilic residues (PATs), considered to be ideal for adopting -sheet structure. The algorithm uses the solution conditions and the sequence as an input and yields the rate constant for aggregation (elongation phase when a lag phase is present) after calculating the charge and average hydrophobicity 1 of the sequence and counting the number and lengths of the PATs. Pawar et al. Pawar et al, 2005 This algorithm identifies the regions of unstructured peptides or naturally unfolded proteins that are most important for promoting the formation of amyloid aggregates. It uses the Dubay et al. algorithm, but considers only the intrinsic factors (charge, hydrophobicity and PATs) to calculate the amyloid aggregation propensities for all the naturally occurring amino acid residues. While the algorithm scans a polypeptide sequence, it assigns to each residue i the average aggregation propensity of the residues within a window of a few residues centred on the residue i and generates the aggregation propensity profile (aggregation propensity plotted versus the residue number). Aggregation propensities are standardized by calculating the standard scores (Z-scores) relative to a reference set of random polypeptides with the same length as the analysed sequence and with a residue composition related to the amino acids frequencies in the Swiss-Prot database. Aggregation-prone regions can thus be identified as those presenting consecutive residue with scores higher than a threshold of 1 (1 std. dev. from the mean aggregation propensity of random sequences). Tartaglia et al. Tartaglia et al, 2004 As the algorithm from Chiti and Dobson it is able to predict the change in the rate of Tartaglia et al, 2005 protofibril formation of unstructured peptides or proteins following mutation. In addition to charge and -propensity, the algorithm also considers water-accessible surface area, dipole moment and -stacking interaction, to describe the aggregation propensity of the 20 naturally occurring amino acid residues. Moreover, any experimentally-determined coefficients was avoided with the aim of generalizing the model as much as possible. The method was improved in 2005 to predict the absolute aggregation propensity and the -aggregating regions together with their preference to adopt a parallel o antiparallel configuration within fibrils. The new method takes into account mainly aromaticity (π-stacking), -propensity and charge, but also water-accessible surface area, water solubility and extrinsic factors such as protein concentration and temperature. Zyggregator Tartaglia et al, 2008 Zyggregator is a development of the Pawar et al. algorithm. It is able to determine the aggregation propensity profile of a given polypeptide in any conformational state. The algorithm predicts which regions of the sequence aggregate into fibrillar or proto-fibrillar structures from disordered or folded states under physiological conditions. It calculates the aggregation propensity for each residue relying on the same intrinsic properties of the previous model plus an additional factor accounting for the presence of gatekeeper residues, i.e. residues that prevent aggregation and usually flanking aggregation-prone regions. The parametric nature of Zyggregator allows re-calibrations depending on different purposes (e.g. prediction under nonphysiological conditions or to form protofibrillar species as opposed to fibrils). Finally, in combination with a local stability score deriving from the CamP method 2 (Tartaglia et al, 2007), Zyggregator is able to consider in the calculus the influence of secondary and tertiary structure formation within folded states on the aggregation propensity of individual amino acid residues. This allows the generation of an aggregation propensity profile where the contribution of each residue is weighted by the level of structure formation. Zyggregator is thus highly versatile and adaptable to various conditions and purposes. SALSA Zibaee et al, 2007 Simple ALgorithm for Sliding Averages (SALSA) was developed to locate regions with high the propensity for -strand structure (fibrillogenic hotspot) within polypeptide sequences, assuming a strong correlation between -strand propensity and formation of fibrillar aggregates. It calculates a mean -strand propensity (MβP) for each residue in a polypeptide. Several sliding windows of different lengths containing a given residue are evaluated, their MβP determined by averaging Chou and Fasman secondary structure propensities and those with a MβP score below 1.2 are discarded. Eventually, all remaining windows are summed up to obtain a specific score per residue. AGGRESCAN Conchillo-Solé et al, 2007 It predicts the aggregation-prone regions in polypeptide sequences without reference to a particular type of the aggregate morphology. By replacing the residue number 19 of the amyloid peptide (A 42) with all possible natural amino acids and by estimating the aggregation propensity of each of the resulting 20 variants through a GFP reporter after expression in E. coli, a scale of aggregation propensity for the 20 naturally occurring residues was edited. AGGRESCAN uses a sliding window procedure to assign to each residue a score by averaging the values, taken from the new scale, of the surrounding residues. From the resulting aggregation propensity profile a set of mathematical descriptors are provided to help the identification of the aggregation hot spots. It is the first predictive method totally based on empirical information in the cellular context and provides predictions of aggregation hot spots of protein sequences and of mutational effects in vitro. Net-CSSP Yoon & Welsh, 2004 It is the first method derived from a structural analysis of folded proteins. It is based Yoon et al, 2007 on the observation that native-strands and-helices occur more frequently in regions with high and low numbers of tertiary contacts (TCs), respectively, and that some sequences having -helical or random coil conformations in native folds can form -strands and promote amyloid fibril formation under certain conditions. Amyloidogenic segments in polypeptide sequences can be identified by searching regions adopting a non--strand conformation within folded proteins, yet having high values of TCs, namely a hidden-strand propensity (HP). An early version of 2004 was based on the average number of TCs for each amino acid, calculated from the SCOP20 dataset. Further implementations of 2007 (Net-CSSP) relied on pairwise potential energy and use artificial neural networks to improve the prediction 3 capability. PASTA Trovato et al, 2006 Prediction of Amyloid STructure Aggregation (PASTA) predicts the regions of polypeptide sequence involved in the formation of ordered cross-β structure. It is based on the assumption that -strands involved in fibril formation adopt β-pairings with minimum energies and with a preference for parallel or anti-parallel in-register arrangement. A dataset of globular proteins with strictly defined secondary structures was investigated in order to calculate the pairing energies for each pair of residues facing one another on parallel or antiparallel neighbouring strands within a β-sheet. Assuming that amyloid fibrils originate from the stacking of β-strands belonging to different polypeptide molecules with identical sequence, PASTA scans an input polypeptide sequence determining both parallel and antiparallel pairing energies for each possible stretch of a given length. Finally the predicted β-pairing involved in aggregation and the related orientation correspond to those with the lowest pairing energy. Similarly to other algorithms, PASTA identifies aggregationprone regions. However, it also aims at determining, unlike most other methods, whether the identified -strands in the fibrils adopt a parallel or antiparallel orientation. FoldAmyloid Galzitskaya et al, 2006 It is a software that aims at predicting the amyloid fibril-forming regions as well as the intrinsically disordered regions of polypeptide sequences through the estimation of the mean packing density. Investigating the SCOP database, the observed mean packing density for each of the 20 amino acid residues was derived, i.e. the mean number of “close” residues around each of the 20 amino acid residues (where “close” means that any pair of the heavy atoms is within a distance of 8 Å). A polypeptide segment within a protein sequence appears amyloidogenic if it includes more than 5 consecutive residues with a strong packing density; it is considered intrinsically disordered when more than 11 consecutive residues show a weak packing density. 3D profile Thompson et al, 2006 This work exploited the available crystallographic structural details of the amyloid fibril forming hexapeptide NNQQNY, to predict the regions of polypeptide sequences forming amyloid fibrils. The crystal structure of NNQQNY was used to create in silico a collection of near-native templates, or 3D profile. The various templates differ for small atomic displacement along the 3 orthogonal axes. The algorithm scans a given sequence by sliding a window of 6 residues, threading the resulting hexapeptide onto each of the templates from the 3D profile and then evaluating the energetic fit by using the ROSETTADESIGN software (Kuhlman et al, 2000). Hexapeptides that yield a minimum energy score lower than a defined threshold have the potential to form amyloid aggregates. The 3D profile method is based on the novel idea of analysing the structure of a peptide in a fibril-like 4 conformation. BETASCAN Bryan et al, 2009 Following the evidence that the core structure of amyloid is based on -strands pairing, the BETASCAN algorithm was designed to determine the most likely strands within a polypeptide and the preferred -strand pairings. For an input sequence BETASCAN estimates the probability for every possible stretch of length 2-13 residues of pairing any other stretch of the same length, relying on pairwise probability tables compiled by the authors. The preference for each pair of amino acids to be hydrogen bonded in a -sheet was derived from a selected subset of the PDB database. All the possible -strands (starting point and length) are visualized in a lattice where each node represent the relative propensity score and predicted strands appear as triangular signals. In addition, the most likely -strand pairs are depicted in a scatter plot. Waltz Maurer-Stroh et al, 2010 It was developed to better distinguish between ordered amyloid aggregates and amorphous -sheet aggregates with the aim of predicting the most likely amyloidforming regions. By assuming that peptides involved in amyloid fibril formation show amino acid preferences in key positions, the authors explored the sequence diversity of several amyloid hexapeptides from the AmylHex dataset, which includes examples of hexapeptides positive or negative for fibril formation. This work allowed the construction of a composite scoring function that includes a positionspecific scoring matrix (PSSM), a set of physicochemical properties and a positionspecific pseudoenergy matrix. This tool allows the identification of amyloid forming regions in sequences. According to the authors, the PSSM that summarizes the amino acid preferences for distinct positions in amyloid-forming hexapeptides is responsible for the predictive power of Waltz. Supplementary Section 1 Criteria adopted for estimating the correlation between predicted and observed changes in amyloid aggregation following mutation. The available experimental data concerning amyloid aggregation in vivo do not allow the determination of absolute values of aggregation rate/propensity of different proteins or the aggregation-promoting regions within them. They only allow to estimate quantitatively the change 5 in the aggregation rate/propensity of a given protein following a given mutation, with this parameter estimated repeatedly using a number of single or multiple mutations. Our review aimed at evaluating the correlation between such experimental data and those estimated in silico by several algorithms. The algorithms are, however, different in the type of predictions they provide. Two of the predictive methods provide directly the change in the aggregation rate upon mutation (Chiti & Dobson and Tartaglia et al.), while the others rather provide aggregation propensity profiles, that is the aggregation propensity as a function of residue number along the sequence (see Table 1 in the main text). Nevertheless, even for an algorithm of this type, it is possible to determine the change in the aggregation rate/propensity upon mutation by comparing the aggregation propensity profiles generated using the wild-type and mutant sequences. As described previously, the change in the aggregation propensity following mutation can be derived by calculating the difference between the aggregation propensity of the mutant sequence and the aggregation propensity of the wild-type sequence (Pawar et al, 2005, see “Determination of the log(kmut/kwt) profiles” in the Materials and methods section). Therefore, for the profile-generating algorithms, the change in the aggregation propensity of a given polypeptide sequence following a given mutation(s), can be assessed by determining one of the following: Pmut-Pwt, where P is the value of the profile at the site of mutation (Pmut-Pwt corresponds to the change in the aggregation propensity at the site of mutation); Amut-Awt, where A is the average of all the values of the profile (Amut-Awt corresponds to the change in the average aggregation propensity); Smut-Swt, where S is the sum of all the values of the profile (Smut-Swt is the change in the total aggregation propensity); Since Amut-Awt appeared to be redundant with respect to Smut-Swt, the former was not considered. 6 The 4 datasets of experimental data in vivo considered in our analysis included either amino acid substitutions at single positions or substitutions at multiple positions. To calculate the effect of single substitutions on the aggregation propensity using profile-generating algorithms we used PmutPwt values. To calculate the effect of multiple substitutions we used Smut-Swt values. The algorithms calculating directly relative aggregation rates (Chiti & Dobson and Tartaglia et al) provides directly the effect of single substitutions on the aggregation propensity, whereas we summed the contributions of single substitutions to calculate the effect of multiple mutations. The correlation between the change in the aggregation propensity following mutation(s) predicted by a given algorithm and the corresponding change in the aggregation propensity (or solubility) observed in vivo was evaluated by applying a linear model. The correlation between the predicted change in the aggregation propensity and the change in solubility or in relative GFP fluorescence was observed to be linear in previous studies (for example Figure 5 in Winkelmann et al 2010 and Figure 4 in de Groot et al 2006). Even though non-linear or multi-modal trends were sporadically observed, more complex analyses were not performed. Supplementary Section 2 Explanation of the parameters used for each algorithm. Here we describe for each algorithm the procedure used to obtain the predicted change in the aggregation propensity upon mutation(s). We will refer to <P> as the aggregation propensity score at the site of mutation and to <S> as the total aggregation propensity score. The Chiti & Dobson and Tartaglia et al methods provides directly the predicted change in the aggregation rate upon mutation, ln(Pwt/Pmut), so further steps were not necessary. The equations described in the original papers were implemented in two distinct scripts using the Perl programming language in order to automatize the computation procedure. Net-CSSP estimates three propensities: “helix propensity” (Pα), “beta propensity” (Pβ) and “coil propensity” (Pcoil). These scores were used to calculate the HβP according to the following 7 P (Yoon & Welsh, 2005). HβP for individual amino acids (<P>) is P Pcoil equation: HβP ln calculated by taking the residue-specific propensities whereas total HβP for the entire sequence (<S>) is calculated by taking the overall propensities. Net-CSSP is available at the following address: http://cssp2.sookmyung.ac.kr/. TANGO provides a “beta aggregation” score for each residue within a sequence besides the total tendency for β-sheet aggregation (“AGG score”). We used the beta aggregation score at the site of mutation as the individual residue propensity (<P>) and the AGG score as the total aggregation propensity (<S>). TANGO is available at the following address: http://tango.crg.es/. The algorithm from Pawar et al provides a propensity score for each residue within a sequence. We used the score at the site of mutation as the individual residue propensity (<P>) and the sum of all the profile scores as the total propensity (<S>). A software that uses the Pawar algorithm was written by M. Ramazzotti and is available upon request. The ZipperDB database collects all the aggregation propensity profiles calculated using the 3D profile method. For a given sequence an energy score is provided for each residue, the energy score being related to the hexapeptide starting at that position. We took the energy score at the site of mutation as individual residue propensity (<P>) and the sum of all the profile scores as total propensity (<S>). ZipperDB is available at the following address: http://services.mbi.ucla.edu/zipperdb/. PASTA gives an aggregation propensity profile showing the “normalized per-residue probability h(k)”. The scores at the site of mutation was used as the individual residue propensity (<P>) while the sum of all the profile scores as the total propensity (<S>). PASTA is available at the following address: http://protein.cribi.unipd.it/pasta/. The FoldAmyloid server calculates profiles with different scales; we chose the default scale (“expected number of contacts 8Å”). From the resulting profile we took the value at the site of 8 mutation for individual residue propensity (<P>) and the sum for the total propensity (<S>). FoldAmyloid is available at the following address: http://antares.protres.ru/fold-amyloid/oga.cgi. AGGRESCAN provides an aggregation profile in addition to several mathematical descriptors. From the profile we used the a4v score at the site of mutation as individual residue propensity (<P>) and the Na4vSS score as the total propensity (<S>). AGGRESCAN is available at the following address: http://bioinf.uab.es/aggrescan/. The SALSA algorithm computes the aggregation propensity profile of a given sequence. The residue score at the site of mutation was used as the individual residue propensity (<P>) and the sum of all the profile scores as the total propensity (<S>). Since the original software is not available online, a new software was implemented using the Perl programming language. Zyggregator computes the aggregation profile of a given sequence. The residue score at the site of mutation was used as the individual residue propensity (<P>) and the sum of all the profile scores as the total propensity (<S>). Zyggregator is available at the following address: http://www-vendruscolo.ch.cam.ac.uk/zyggregator.php/. For the Waltz algorithm the “detailed with graphics” output was chosen with a threshold of 0 in order to obtain complete profiles. The value on the graph corresponding to the residue at the site of mutation was used as individual residue propensity (<P>) and the sum of all the values as the total propensity (<S>). Waltz is available at the following address: http://waltz.vub.ac.be/. Experimental conditions such as temperature, pH and ionic strength were set, when requested by the algorithms, according to the information reported by the original experimental papers, while other parameters were used as default values. The software we implemented in-house for the algorithms by Chiti & Dobson, Tartaglia et al, Pawar et al and SALSA, were checked before using them to ensure that the results published in the original papers were reproduced. 9 Supplementary References Bryan AW, Menke M, Cowen LJ, Lindquist SL, Berger B (2009) BETASCAN: probable amyloids identified by pairwise probabilistic analysis. PLoS Comput Biol 5: e1000333. Chiti F, Stefani M, Taddei N, Ramponi G, Dobson CM (2003) Rationalization of the effects of mutations on peptide and protein aggregation rates. Nature 424: 805-8. Conchillo-Solé O, de Groot NS, Avilés FX, Vendrell J, Daura X, Ventura S (2007) AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides. BMC Bioinformatics 8: 65. de Groot NS, Aviles FX, Vendrell J, Ventura S (2006) Mutagenesis of the central hydrophobic cluster in Abeta42 Alzheimer's peptide. Side-chain properties correlate with aggregation propensities. FEBS J 273: 658-68. DuBay KF, Pawar AP, Chiti F, Zurdo J, Dobson CM, Vendruscolo M (2004) Prediction of the absolute aggregation rates of amyloidogenic polypeptide chains. J Mol Biol 341: 1317-26. Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L (2004) Prediction of sequencedependent and mutational effects on the aggregation of peptides and proteins. Nat Biotechnol 22: 1302-6. Galzitskaya OV, Garbuzynskiy SO, Lobanov MY (2006) Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput Biol 2: e177. Kuhlman B, Baker D (2000) Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci U S A 97: 10383-8. Maurer-Stroh S et al (2010) Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nat Methods 7: 237-42. Monsellier E, Ramazzotti M., de Laureto P.P., Tartaglia G.G., Taddei N., Fontana A., Vendruscolo M., Chiti F. (2007) The distribution of residues in a polypeptide sequence is a determinant of aggregation optimized by evolution. Biophys J 93: 4382-91. 10 Pawar AP, Dubay K.F., Zurdo J., Chiti F., Vendruscolo M., Dobson C.M. (2005) Prediction of "aggregation-prone" and "aggregation-susceptible" regions in proteins associated with neurodegenerative diseases. J Mol Biol 350: 379-92. Tartaglia GG, Cavalli A, Pellarin R, Caflisch A (2004) The role of aromaticity, exposed surface, and dipole moment in determining protein aggregation rates. Protein Sci 13: 1939-41. Tartaglia GG, Cavalli A, Pellarin R, Caflisch A (2005) Prediction of aggregation rate and aggregation-prone segments in polypeptide sequences. Protein Sci 14: 2723-34. Tartaglia GG, Cavalli A, Vendruscolo M (2007) Prediction of local structural stabilities of proteins from their amino acid sequences. Structure 15: 139-43. Tartaglia GG, Vendruscolo M (2008) The Zyggregator method for predicting protein aggregation propensities. Chem Soc Rev 37: 1395-401. Thompson MJ, Sievers SA, Karanicolas J, Ivanova MI, Baker D, Eisenberg D (2006) The 3D profile method for identifying fibril-forming segments of proteins. Proc Natl Acad Sci U S A 103: 4074-8. Trovato A, Chiti F, Maritan A, Seno F (2006) Insight into the structure of amyloid fibrils from the analysis of globular proteins. PLoS Comput Biol 2: e170. Winkelmann J, Calloni G, Campioni S, Mannini B, Taddei N, Chiti F (2010) Low-level expression of a folding-incompetent protein in Escherichia coli: search for the molecular determinants of protein aggregation in vivo. J Mol Biol 398: 600-13. Yoon S, Welsh WJ (2004) Detecting hidden sequence propensity for amyloid fibril formation. Protein Sci 13: 2149-60. Yoon S, Welsh W.J. (2005) Rapid assessment of contact-dependent secondary structure propensity: relevance to amyloidogenic sequences. Proteins 60: 110-7. Yoon S, Welsh WJ, Jung H, Yoo YD (2007) CSSP2: an improved method for predicting contactdependent secondary structure propensity. Comput Biol Chem 31: 373-7. 11 Zibaee S, Makin OS, Goedert M, Serpell LC (2007) A simple algorithm locates -strands in the amyloid fibril core of -synuclein, A, and tau using the amino acid sequence alone. Protein Sci 16: 906-18. 12 Figure S1. Predicted change in the aggregation propensity upon mutation versus experimental (in vivo) solubility for A42 variants. Each graph reports the predicted change in the aggregation propensity upon mutation (calculated according to the algorithm indicated in each graph and the procedure described in the suppl. info.) versus experimental relative fluorescence of GFP fused to A42 mutants, as described (Wurth et al 2002). Scales on the yaxis have been adjusted to show, for each plot, the full dataset. The lines represent the best fits of the data to linear functions. For each plot the name of the algorithm is reported, as well as the absolute value of the Pearson linear correlation coefficient (r) and the statistical significance of the slope (p). 13 Figure S2. Predicted change in the aggregation propensity upon mutation versus experimental (in vivo) solubility in E. coli cytosol for A42 variants. Each graph reports the predicted change in the aggregation propensity upon mutation (according to the algorithm indicated in each graph and following the procedure described in the suppl. info.) versus experimental relative fluorescence of GFP fused to A42 mutants, as described (Kim et al 2006). Scales on the y-axis have been adjusted to show, for each plot, the full dataset. The lines represent the best fits of the data to linear functions. For each plot the name of the algorithm is reported, as well as the absolute value of the Pearson linear correlation coefficient (r) and the statistical significance of the slope (p). 14