* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Supplemental Results Spatial dispersion often indicates
Survey
Document related concepts
Transcript
Supplemental Results Spatial dispersion often indicates enrichment for protein surface residues We then evaluated whether missense variants were generally more solvent accessible than expected (Figure S2). We found that the RSA of all missense variants (Nall,missense=209,841) was significantly greater than the RSA of all residues (Nall,residue=972,121; P ≈ 0, Mann–Whitney U test) and that the RSA of missense variants in significantly dispersed proteins (Ndispersed,missense=2,253) was significantly greater than the RSA of missense variants (P = 5.0x1054 , Mann–Whitney U test) over all proteins. These results agree with previously observed patterns of 1000 Genomes missense variants1 and suggests that the spatial dispersion of missense variants in many proteins may reflect an intolerance for missense variation in the protein core. To further test the hypothesis that missense dispersion is correlated with surfaceexposure, we performed a weighted, univariate K analysis of relative solvent accessibility (RSA) (Figure S3). If collections of surface-exposed residues yield high dispersion values, then these analyses should yield highly significant dispersion across all structures. To determine RSA, we calculated the solvent accessible surface area with dssp2 and normalized all values by the maximum solvent accessible surface area of each amino acid in an Ala-X-Ala tripeptide. We identified significant spatial dispersion in univariate K tests weighted by RSA for 4,114 of 4,495 proteins (92%, FDR<10%). Evolutionarily conserved residues are spatially constrained and generally clustered Evolutionary conservation is a predictor of the functionally important amino acids, which are hypothesized to cluster within protein structure at functionally important sites3–6. To evaluate this hypothesis, we quantified evolutionary conservation by calculating the Jensen-Shannon divergence5 of amino acids using multiple sequence alignments from HSSP2 and performed a weighted, univariate analysis of the conservation scores. We identified significant clustering in 3,752 of 4,286 proteins (88%, FDR<0.1) and significant dispersion in 101 proteins (2%). (Figure S4). These results suggest strong spatial constraint on protein function and that functionally important amino acids are commonly clustered within protein structure. Dominant missense variants form smaller clusters than recessive variants Having demonstrated pathogenic variants trend towards spatial clustering, we next evaluated whether the mode of inheritance for pathogenic variants influences their spatial constraint. Missense variants causing protein loss-of-function (LoF) may disrupt numerous critical elements of a protein structure, but the opportunity for gain-of-function (GoF) is likely limited to a small subset of regions with functional potential. Previous work by Turner et al.7 investigated these spatial patterns in protein sequence using autosomal dominant (AD, typically gain-of-function) and autosomal recessive (AR, typically loss-of-function) missense variants from the Human Gene Mutation Database8 (HGMD). Turner et al. demonstrated a significant global trend for dominant variants to be more clustered than recessive, which in turn were more clustered than neutral variants from the 1000 Genomes Project, with dominant variants in 9 proteins and recessive variants in 5 proteins significantly more clustered than neutral variants (FDR<5%). The functional impact of gain- and loss-of-function missense variants is derived from their effect on protein structure. Thus, the spatial distributions derived from these effects are perhaps more accurately evaluated within that context. Using the HGMD dataset curated by Turner et al., we performed two bivariate analyses evaluating dominant and recessive missense variants relative to ExAC missense variants (Figure S7). We identified 27 (of 69, 39%) and 16 (of 47, 34%) structures in which dominant and recessive variants (respectively) were significantly more clustered than variants from ExAC (FDR<10%). Additionally, we found that univariate scores for both dominant and recessive variants were significantly higher (more clustered) than ExAC variants (AD: P = 3.53x10-30, AR: P = 6.97x10-20, Mann–Whitney U test), but found no significant difference between dominant and recessive variants (P = 0.274). However, within proteins with significantly clustered variation, dominant variants (NAD=35) formed significantly smaller clusters (median peak significance: 10Å) than recessive variants (NAR=16; median peak significance: 13.5Å; P = 0.014, Mann–Whitney U test). These findings support previous conclusions that both gain- and loss-of-function variants are more clustered than neutral variants. The smaller clusters formed by dominant variants additionally support the hypothesis that gainof-function mutations are localized to specific sites with functional potential, while loss-offunction mutations more generally disrupt regions of functional importance. Quantitative spatial analysis of variants in the RTEL1 C-terminal model The C-terminus model of RTEL1 (residues 881-1151) contained only five known pathogenic variants, for which we observed no spatial clustering. However, variants appearing nearby even one known pathogenic variant could have a higher likelihood of being pathogenic. As described for the N-terminal model, we used leave-one-out cross validation to evaluate the predictive performance of pathogenic proximity in the C-terminal model. We obtained a low ROC AUC of 0.47 (Figure S12), indicating very poor predictive performance in the absence of pathogenic clustering. Neither VUS in the C-terminal model (F1110L and P1107L) segregated with disease and both were predicted neutral by pathogenic proximity. Identification of rare missense variants of unknown significance in RTEL1 in patients with Familial Interstitial Pneumonia The use of next-generation sequencing to study families with pulmonary diseases has led to the identification of novel genes and mechanisms associated with the inherited forms of pulmonary fibrosis9–11 (familial interstitial pneumonia, FIP). Genetic variation in telomere-related genes is the predominant cause of disease in families with known genetic etiology, and telomere shortening in peripheral blood mononuclear cells (PBMC)12–14 and type II alveolar epithelial cells9,14 is commonly observed in patients and families with idiopathic pulmonary fibrosis. The most commonly mutated genes in FIP patients are TERT (10–15% of cases)15,16, RTEL1, and PARN (3–4% of cases each)9,10. Most newly identified missense variants in FIP-associated genes are considered variants of unknown significance (VUS). Three hundred and forty-six (346) subjects from 189 families were screened by WES, and probands from an additional 184 FIP kindreds underwent RTEL1 targeted Sanger sequencing. After quality control, WES identified 6 loss-of-function (LOF) rare variants (RVs, MAF<0.001) in RTEL1 that co-segregated with FIP and 8 missense RVs, 4 of which fully segregated with FIP9. Sanger sequencing identified 4 additional LOF variants that fully segregated with FIP, and 10 additional missense RVs (which were used as variants of unknown significance in our spatial analysis of RTEL1), 5 of which fully segregated with disease. In total, we identified 19 RTEL1 RVs (9 missense) that fully segregated with FIP in 373 families (5.6% of families); 14 RTEL1 RVs (10 missense) were not previously associated with FIP. Structural interpretations for FIP-segregating RTEL1 variant pathogenicity We constructed a homology model of the N-terminal structure of RTEL1 and analyzed 9 novel missense VUS relative to the spatial distribution of known pathogenic and neutral variation. The five VUS that segregated with FIP were all predicted to be pathogenic by our method. Below, we outline potential mechanisms of action – ranging from disruption of protein-protein or proteinDNA interactions to destabilization of the tertiary structure of the protein – for each segregating VUS. W512C: W512 is a bulky aromatic residue found on the surface of the structural model (Figure S11a). Surface-exposed aromatic side-chains are uncommon, and are often found to be important anchors for protein-protein binding surfaces. Replacing the tryptophan sidechain with the smaller, less hydrophobic cysteine may alter the shape and physicochemical character of a critical protein-binding surface of RTEL1, compromising its ability to perform its normal physiological function. This hypothesis is bolstered by the observation that this variant is ranked highest by our proximity score, indicating that other mutations found in close proximity to W512C – i.e. on or adjacent to the surface and likely to act through a common mechanism – are disease-linked. The importance of protein-protein interactions to RTEL1 function is underscored by the 46 unique interactions reported for RTEL1 by the BioGrid database 17. F559I: F559 is a bulky aromatic residue found on the interior of the protein model, within 9 Å of the predicted DNA-binding interface (Figure S11b). Replacement of the large volume of the phenylalanine side chain with the smaller volume of isoleucine could alter the geometry of the DNA-binding cavity sufficiently to disrupt that interaction. Notably, while F559 is in the second shell of residues responsible for DNA contact, it is predicted to be directly adjacent to two first-shell residues, E591 and A621, which have been previously reported as disease-associated 18. S688C: S688 is located on a buried helix one turn (5.9 Å) away from disease-associated residue R684. The mutation of serine to cysteine does not result in major changes in bulk, branching, charge, or hydrophobicity. However, the presence of the sulfhydryl group in the cysteine could potentially promote misfolding and aggregation upon incorrect formation of disulfide bonds, if exposed to oxidation. D719G: D719 is located on a surface-exposed helix near the pathogenic cluster (Figure S11c). Replacing the large charged aspartate sidechain with the single hydrogen of a glycine removes a bulky charge from the protein surface and likely disrupts the helix in that region. T55S: T55 is a polar residue predicted to lie at the interface between alpha helices 1 and 2 (Figure S11d). Relative to the other segregating variants, T55S is distal to the pathogenic cluster and is relatively equidistant to pathogenic and neutral variation. Both threonine and serine are unusual residues to find in a helix-helix interface, and suggest that this position may be functionally important. Replacement of a threonine sidechain with that of serine does not alter the hydroxyl character of the residue, though it reduces the steric bulk by one methyl group. This is not a major volumetric change, but the removal of a beta-branching amino acid could affect inter-helical packing. This steric change could result in a relative repacking of the helix-helix interface, or could change the strength of interaction between the helices. Although T55 is evolutionarily conserved, SIFT, PolyPhen2, and CADD all confidently predict the serine substitution to be benign. Ultimately, there is no obvious structural basis for the pathogenicity of T55S and its distance from the pathogenic cluster suggests that any functional effects are likely impacting alternative mechanisms. RTEL1 pathogenic proximity correlates with XPD ATPase activity Many FIP-causing variants in RTEL1 occupy domains that are homologous with the protein XPD (ERCC2). Mutations in XPD are associated with Xeroderma pigmentosum (XP), Cockayne syndrome (CS), and trichothiodystrophy. We hypothesized that despite differences in phenotypic presentation, pathogenic variants in both proteins likely disrupt shared mechanisms, leading to similar biochemical effects. To investigate this, we mapped XPD mutagenesis and biochemical activity data from Fan et al.19 (N=15) and Kuper et al.20 (N=9) into RTEL1 (Figure S13). For each XPD mutation, we calculated the pathogenic proximity score with respect to RTEL1 pathogenic and neutral variants and measured the correlation with change (% wild-type) in XPD helicase and ATPase activity (Figure S10). Proximity to pathogenic variants in RTEL1 was significantly correlated with reduced ATPase activity relative to wild-type (Spearman rho=–0.62, P = 0.001), but not with helicase activity (Spearman rho=–0.09, P = 0.7). Supplementary Tables All of the tables below are provided as separate files. Table S1: Results of the univariate spatial analysis of ExAC synonymous variants. Table S2: Results of the univariate spatial analysis of ExAC missense variants. Table S3: Results of the univariate and bivariate spatial analyses of ClinVar pathogenic missense variants. Table S4: Results of the univariate and bivariate spatial analyses of COSMIC recurrent somatic missense variants. Supplementary Figures Figure S1: Synonymous variants display similar spatial patterns across proteins in different CATH (Class Architecture Topology Homology) domains. The number of proteins analyzed from each class is provided to the right of the class label. Despite the significant difference in the size of proteins from each class (ANOVA P = 1.3x10-101), the difference in Z-scores was less significant (ANOVA P = 0.034). Figure S2: Comparison of relative solvent accessibility for residue subsets across all protein structures. Missense variants are more solvent accessible than all residues (Median RSAmissense=0.22, Median RSAall=0.18, P ≈ 0, Mann-Whitney U). Dispersed missense variants are significantly more solvent accessible than all missense (Median RSAdispersed=0.37, P = 5x10-54). This is consistent with constraint against missense mutations in the protein core. Missense variants in proteins that exhibit focal clustering have similar solvent accessibility patterns to all residues (Median RSAclustered=0.19, P = 0.39), suggesting that missense variant clusters commonly occur in the protein core and surface. Solvent accessibility was calculated with DSSP2 and normalized by the total surface area of each amino acid. Figure S3: Distribution of protein Z-scores for the weighted univariate analysis of relative solvent accessibility (RSA). The significant dispersion of RSA in 92% of proteins suggests that spatial dispersion correlates with surface residues. Figure S4: Distribution of protein Z-scores for the weighted univariate analysis of evolutionary conservation as measured by Jenson-Shannon divergence. Evolutionary conservation is significantly clustered in 88% of protein structures. Figure S5: Autosomal dominant and recessive missense variants from the Human Gene Mutation Database (HGMD) are both spatially clustered in protein structure, consistent with ClinVar pathogenic variation. No significant difference in the degree of clustering was identified between the two groups, but dominant mutations did on average form smaller clusters (AD=11Å, AR=14Å). Figure S6: Comparison of our findings with previous studies of somatic mutation in cancer. The overlap in genes found to harbor significantly clustered somatic missense variation between related studies. CBL, DICER1, TET2 were uniquely identified by our univariate analysis of recurrent somatic missense variation. Figure S7: Distribution of Z-scores for each pathogenic dataset relevant to ExAC missense. Figure S8: PathProx performance (ROC AUC) stratified by CATH domain. There were no significant differences in prediction performance between CATH domains (ANOVA P = 0.28). Figure S9: Comparing the performance (ROC AUC) of pathogenic proximity with SIFT (a) and PolyPhen2 (b) across proteins demonstrates that different methods perform better for different proteins. Figure S10: Pathogenic proximity scores were calculated for each missense mutation using position relative to pathogenic and neutral missense variants in RTEL1. (A) Pathogenic proximity was significantly correlated with a decrease in ATPase activity (Spearman rho=–0.62, P = 0.001), but (B) not significantly correlated with changes in helicase activity (Spearman rho=–0.09, P = 0.7). Figure S11: Structural hypotheses about the effects of four novel segregating RTEL1 mutations. (A) Tryptophan 512 is predicted to lie on the surface of the protein. A mutation to cysteine has the potential to interfere with functionally important proteinprotein interactions. (B) Phenylalanine 559 is buried in the core of the protein, in close proximity to residues predicted to form part of the DNA-binding cavity, including Alanine 621 and Glutamic Acid 591. Mutation to isoleucine removes steric bulk and is likely to leave a void in the hydrophobic core of the protein, disrupting structure and reducing stability. (C) Aspartic Acid 719 is predicted to fall in a surface-exposed helix. Mutation to glycine drastically reduces both the bulk and charge of the protein’s surface, and likely disrupts the helix at that point. (D) Threonine 55 is predicted to form part of the interface between helices 1 and 2 in RTEL1. Mutation to a serine would reduce the steric bulk and alter the packing between the two helices. Figure S12: Receiver operating characteristic (ROC) curves for variants in the C-terminal model of RTEL1. The three pathogenic variants in the C-terminal model were not spatially clustered and predictive performance was notably worse than the N-terminal model. This result illustrates the relevance of this approach only in situations where nominal evidence for pathogenic clustering is present. Figure S13: ATPase and Helicase reported activity (as percentage of wild type) for missense mutations in saXPD and taXPD. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. de Beer, T. a P. et al. Amino Acid changes in disease-associated variants differ radically from variants observed in the 1000 genomes project dataset. PLoS Comput. Biol. 9, e1003382 (2013). Touw, W. G. et al. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 43, D364–D368 (2015). Schueler-furman, O. & Baker, D. Conserved Residue Clustering and Protein Structure Prediction. 235, 225–235 (2003). Madabushi, S. et al. Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J. Mol. Biol. 316, 139–54 (2002). Capra, J. A. & Singh, M. Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875–1882 (2007). Capra, J. A., Laskowski, R. A., Thornton, J. M., Singh, M. & Funkhouser, T. A. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput. Biol. 5, (2009). Turner, T. N. et al. Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns. Hum. Mol. Genet. 24, 5995–6002 (2015). Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 Update. Hum. Mutat. 21, 577–581 (2003). Cogan, J. D. et al. Rare Variants in RTEL1 Are Associated with Familial Interstitial Pneumonia. Am. J. Respir. Crit. Care Med. 191, 646–655 (2015). Stuart, B. D. et al. Exome sequencing links mutations in PARN and RTEL1 with familial pulmonary fibrosis and telomere shortening. Nat. Genet. 47, 512–517 (2015). Caroline Kannengiesser1, 2*, Raphael Borie3*, Christelle Ménard1, Marion Réocreux1, P., Nitschké2, 4, Steven Gazal2, 5, 6, Hervé Mal7, Jacques Cadranel8, 9, Hilario Nunes10, 11, D., Valeyre10, 11, Jean François Cordier, 13, Isabelle Callebaut14, Catherine Boileau1, 2, V. & Cottin12, 13, Bernard Grandchamp1, 2, Patrick Revy15, Bruno Crestani 2, 3. Heterozygous RTEL1 mutations is a major cause of familial pulmonary fibrosis. Eur. Respir. J. (2015). Diaz de Leon, A. et al. Telomere lengths, pulmonary fibrosis and telomerase (TERT) Mutations. PLoS One 5, (2010). Cronkhite, J. T. et al. Telomere shortening in familial and sporadic pulmonary fibrosis. Am. J. Respir. Crit. Care Med. 178, 729–737 (2008). Armanios, M. et al. Short telomeres are a risk factor for idiopathic pulmonary fibrosis. Proc. Natl. Acad. Sci. U. S. A. 105, 13051–6 (2008). Armanios, M. et al. Telomerase mutations in families with idiopathic pulmonary fibrosis. N. Engl. J. Med. 356, 1317–26 (2007). Tsakiri, K. D. et al. Adult-onset pulmonary fibrosis caused by mutations in telomerase. Proc. Natl. Acad. Sci. U. S. A. 104, 7552–7 (2007). Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539 (2006). Ballew, B. J. et al. Germline mutations of regulator of telomere elongation helicase 1, RTEL1, in Dyskeratosis congenita. Hum. Genet. 132, 473–480 (2013). Fan, L., Fuss, J., Cheng, Q., Arvai, A. & Hammel, M. XPD helicase structures and activities: insights into the cancer and aging phenotypes from XPD mutations. Cell (2008). 20. Kuper, J., Wolski, S., Michels, G. & Kisker, C. Functional and structural studies of the nucleotide excision repair helicase XPD suggest a polarity for DNA translocation. EMBO J. (2012).