Download Supplemental Results Spatial dispersion often indicates

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microevolution wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epistasis wikipedia , lookup

NEDD9 wikipedia , lookup

Mutation wikipedia , lookup

Genetic code wikipedia , lookup

Protein moonlighting wikipedia , lookup

Frameshift mutation wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Supplemental Results
Spatial dispersion often indicates enrichment for protein surface residues
We then evaluated whether missense variants were generally more solvent accessible than
expected (Figure S2). We found that the RSA of all missense variants (Nall,missense=209,841) was
significantly greater than the RSA of all residues (Nall,residue=972,121; P ≈ 0, Mann–Whitney U
test) and that the RSA of missense variants in significantly dispersed proteins
(Ndispersed,missense=2,253) was significantly greater than the RSA of missense variants (P = 5.0x1054
, Mann–Whitney U test) over all proteins. These results agree with previously observed
patterns of 1000 Genomes missense variants1 and suggests that the spatial dispersion of missense
variants in many proteins may reflect an intolerance for missense variation in the protein core.
To further test the hypothesis that missense dispersion is correlated with surfaceexposure, we performed a weighted, univariate K analysis of relative solvent accessibility (RSA)
(Figure S3). If collections of surface-exposed residues yield high dispersion values, then these
analyses should yield highly significant dispersion across all structures. To determine RSA, we
calculated the solvent accessible surface area with dssp2 and normalized all values by the
maximum solvent accessible surface area of each amino acid in an Ala-X-Ala tripeptide. We
identified significant spatial dispersion in univariate K tests weighted by RSA for 4,114 of 4,495
proteins (92%, FDR<10%).
Evolutionarily conserved residues are spatially constrained and generally clustered
Evolutionary conservation is a predictor of the functionally important amino acids, which are
hypothesized to cluster within protein structure at functionally important sites3–6. To evaluate this
hypothesis, we quantified evolutionary conservation by calculating the Jensen-Shannon
divergence5 of amino acids using multiple sequence alignments from HSSP2 and performed a
weighted, univariate analysis of the conservation scores. We identified significant clustering in
3,752 of 4,286 proteins (88%, FDR<0.1) and significant dispersion in 101 proteins (2%). (Figure
S4). These results suggest strong spatial constraint on protein function and that functionally
important amino acids are commonly clustered within protein structure.
Dominant missense variants form smaller clusters than recessive variants
Having demonstrated pathogenic variants trend towards spatial clustering, we next evaluated
whether the mode of inheritance for pathogenic variants influences their spatial constraint.
Missense variants causing protein loss-of-function (LoF) may disrupt numerous critical elements
of a protein structure, but the opportunity for gain-of-function (GoF) is likely limited to a small
subset of regions with functional potential. Previous work by Turner et al.7 investigated these
spatial patterns in protein sequence using autosomal dominant (AD, typically gain-of-function)
and autosomal recessive (AR, typically loss-of-function) missense variants from the Human
Gene Mutation Database8 (HGMD). Turner et al. demonstrated a significant global trend for
dominant variants to be more clustered than recessive, which in turn were more clustered than
neutral variants from the 1000 Genomes Project, with dominant variants in 9 proteins and
recessive variants in 5 proteins significantly more clustered than neutral variants (FDR<5%).
The functional impact of gain- and loss-of-function missense variants is derived from their effect
on protein structure. Thus, the spatial distributions derived from these effects are perhaps more
accurately evaluated within that context. Using the HGMD dataset curated by Turner et al., we
performed two bivariate analyses evaluating dominant and recessive missense variants relative to
ExAC missense variants (Figure S7). We identified 27 (of 69, 39%) and 16 (of 47, 34%)
structures in which dominant and recessive variants (respectively) were significantly more
clustered than variants from ExAC (FDR<10%). Additionally, we found that univariate scores
for both dominant and recessive variants were significantly higher (more clustered) than ExAC
variants (AD: P = 3.53x10-30, AR: P = 6.97x10-20, Mann–Whitney U test), but found no
significant difference between dominant and recessive variants (P = 0.274). However, within
proteins with significantly clustered variation, dominant variants (NAD=35) formed significantly
smaller clusters (median peak significance: 10Å) than recessive variants (NAR=16; median peak
significance: 13.5Å; P = 0.014, Mann–Whitney U test). These findings support previous
conclusions that both gain- and loss-of-function variants are more clustered than neutral variants.
The smaller clusters formed by dominant variants additionally support the hypothesis that gainof-function mutations are localized to specific sites with functional potential, while loss-offunction mutations more generally disrupt regions of functional importance.
Quantitative spatial analysis of variants in the RTEL1 C-terminal model
The C-terminus model of RTEL1 (residues 881-1151) contained only five known pathogenic
variants, for which we observed no spatial clustering. However, variants appearing nearby even
one known pathogenic variant could have a higher likelihood of being pathogenic. As described
for the N-terminal model, we used leave-one-out cross validation to evaluate the predictive
performance of pathogenic proximity in the C-terminal model. We obtained a low ROC AUC of
0.47 (Figure S12), indicating very poor predictive performance in the absence of pathogenic
clustering. Neither VUS in the C-terminal model (F1110L and P1107L) segregated with disease
and both were predicted neutral by pathogenic proximity.
Identification of rare missense variants of unknown significance in RTEL1 in patients with
Familial Interstitial Pneumonia
The use of next-generation sequencing to study families with pulmonary diseases has led to the
identification of novel genes and mechanisms associated with the inherited forms of pulmonary
fibrosis9–11 (familial interstitial pneumonia, FIP). Genetic variation in telomere-related genes is
the predominant cause of disease in families with known genetic etiology, and telomere
shortening in peripheral blood mononuclear cells (PBMC)12–14 and type II alveolar epithelial
cells9,14 is commonly observed in patients and families with idiopathic pulmonary fibrosis. The
most commonly mutated genes in FIP patients are TERT (10–15% of cases)15,16, RTEL1, and
PARN (3–4% of cases each)9,10. Most newly identified missense variants in FIP-associated genes
are considered variants of unknown significance (VUS).
Three hundred and forty-six (346) subjects from 189 families were screened by WES,
and probands from an additional 184 FIP kindreds underwent RTEL1 targeted Sanger
sequencing. After quality control, WES identified 6 loss-of-function (LOF) rare variants (RVs,
MAF<0.001) in RTEL1 that co-segregated with FIP and 8 missense RVs, 4 of which fully
segregated with FIP9. Sanger sequencing identified 4 additional LOF variants that fully
segregated with FIP, and 10 additional missense RVs (which were used as variants of unknown
significance in our spatial analysis of RTEL1), 5 of which fully segregated with disease. In total,
we identified 19 RTEL1 RVs (9 missense) that fully segregated with FIP in 373 families (5.6%
of families); 14 RTEL1 RVs (10 missense) were not previously associated with FIP.
Structural interpretations for FIP-segregating RTEL1 variant pathogenicity
We constructed a homology model of the N-terminal structure of RTEL1 and analyzed 9 novel
missense VUS relative to the spatial distribution of known pathogenic and neutral variation. The
five VUS that segregated with FIP were all predicted to be pathogenic by our method. Below, we
outline potential mechanisms of action – ranging from disruption of protein-protein or proteinDNA interactions to destabilization of the tertiary structure of the protein – for each segregating
VUS.
W512C: W512 is a bulky aromatic residue found on the surface of the structural model
(Figure S11a). Surface-exposed aromatic side-chains are uncommon, and are often found to be
important anchors for protein-protein binding surfaces. Replacing the tryptophan sidechain with
the smaller, less hydrophobic cysteine may alter the shape and physicochemical character of a
critical protein-binding surface of RTEL1, compromising its ability to perform its normal
physiological function. This hypothesis is bolstered by the observation that this variant is ranked
highest by our proximity score, indicating that other mutations found in close proximity to
W512C – i.e. on or adjacent to the surface and likely to act through a common mechanism – are
disease-linked. The importance of protein-protein interactions to RTEL1 function is underscored
by the 46 unique interactions reported for RTEL1 by the BioGrid database 17.
F559I: F559 is a bulky aromatic residue found on the interior of the protein model,
within 9 Å of the predicted DNA-binding interface (Figure S11b). Replacement of the large
volume of the phenylalanine side chain with the smaller volume of isoleucine could alter the
geometry of the DNA-binding cavity sufficiently to disrupt that interaction. Notably, while F559
is in the second shell of residues responsible for DNA contact, it is predicted to be directly
adjacent to two first-shell residues, E591 and A621, which have been previously reported as
disease-associated 18.
S688C: S688 is located on a buried helix one turn (5.9 Å) away from disease-associated
residue R684. The mutation of serine to cysteine does not result in major changes in bulk,
branching, charge, or hydrophobicity. However, the presence of the sulfhydryl group in the
cysteine could potentially promote misfolding and aggregation upon incorrect formation of
disulfide bonds, if exposed to oxidation.
D719G: D719 is located on a surface-exposed helix near the pathogenic cluster (Figure
S11c). Replacing the large charged aspartate sidechain with the single hydrogen of a glycine
removes a bulky charge from the protein surface and likely disrupts the helix in that region.
T55S: T55 is a polar residue predicted to lie at the interface between alpha helices 1 and
2 (Figure S11d). Relative to the other segregating variants, T55S is distal to the pathogenic
cluster and is relatively equidistant to pathogenic and neutral variation. Both threonine and serine
are unusual residues to find in a helix-helix interface, and suggest that this position may be
functionally important. Replacement of a threonine sidechain with that of serine does not alter
the hydroxyl character of the residue, though it reduces the steric bulk by one methyl group. This
is not a major volumetric change, but the removal of a beta-branching amino acid could affect
inter-helical packing. This steric change could result in a relative repacking of the helix-helix
interface, or could change the strength of interaction between the helices. Although T55 is
evolutionarily conserved, SIFT, PolyPhen2, and CADD all confidently predict the serine
substitution to be benign. Ultimately, there is no obvious structural basis for the pathogenicity of
T55S and its distance from the pathogenic cluster suggests that any functional effects are likely
impacting alternative mechanisms.
RTEL1 pathogenic proximity correlates with XPD ATPase activity
Many FIP-causing variants in RTEL1 occupy domains that are homologous with the protein
XPD (ERCC2). Mutations in XPD are associated with Xeroderma pigmentosum (XP), Cockayne
syndrome (CS), and trichothiodystrophy. We hypothesized that despite differences in phenotypic
presentation, pathogenic variants in both proteins likely disrupt shared mechanisms, leading to
similar biochemical effects. To investigate this, we mapped XPD mutagenesis and biochemical
activity data from Fan et al.19 (N=15) and Kuper et al.20 (N=9) into RTEL1 (Figure S13). For
each XPD mutation, we calculated the pathogenic proximity score with respect to RTEL1
pathogenic and neutral variants and measured the correlation with change (% wild-type) in XPD
helicase and ATPase activity (Figure S10). Proximity to pathogenic variants in RTEL1 was
significantly correlated with reduced ATPase activity relative to wild-type (Spearman rho=–0.62,
P = 0.001), but not with helicase activity (Spearman rho=–0.09, P = 0.7).
Supplementary Tables
All of the tables below are provided as separate files.
Table S1: Results of the univariate spatial analysis of ExAC synonymous variants.
Table S2: Results of the univariate spatial analysis of ExAC missense variants.
Table S3: Results of the univariate and bivariate spatial analyses of ClinVar pathogenic
missense variants.
Table S4: Results of the univariate and bivariate spatial analyses of COSMIC recurrent somatic
missense variants.
Supplementary Figures
Figure S1: Synonymous variants display similar spatial patterns across proteins in different CATH (Class Architecture Topology
Homology) domains. The number of proteins analyzed from each class is provided to the right of the class label. Despite the
significant difference in the size of proteins from each class (ANOVA P = 1.3x10-101), the difference in Z-scores was less
significant (ANOVA P = 0.034).
Figure S2: Comparison of relative solvent accessibility for residue subsets across all protein structures. Missense variants are
more solvent accessible than all residues (Median RSAmissense=0.22, Median RSAall=0.18, P ≈ 0, Mann-Whitney U). Dispersed
missense variants are significantly more solvent accessible than all missense (Median RSAdispersed=0.37, P = 5x10-54). This is
consistent with constraint against missense mutations in the protein core. Missense variants in proteins that exhibit focal
clustering have similar solvent accessibility patterns to all residues (Median RSAclustered=0.19, P = 0.39), suggesting that missense
variant clusters commonly occur in the protein core and surface. Solvent accessibility was calculated with DSSP2 and normalized
by the total surface area of each amino acid.
Figure S3: Distribution of protein Z-scores for the weighted univariate analysis of relative solvent accessibility (RSA). The
significant dispersion of RSA in 92% of proteins suggests that spatial dispersion correlates with surface residues.
Figure S4: Distribution of protein Z-scores for the weighted univariate analysis of evolutionary conservation as measured by
Jenson-Shannon divergence. Evolutionary conservation is significantly clustered in 88% of protein structures.
Figure S5: Autosomal dominant and recessive missense variants from the Human Gene Mutation Database (HGMD) are both
spatially clustered in protein structure, consistent with ClinVar pathogenic variation. No significant difference in the degree of
clustering was identified between the two groups, but dominant mutations did on average form smaller clusters (AD=11Å,
AR=14Å).
Figure S6: Comparison of our findings with previous studies of somatic mutation in cancer. The overlap in genes found to
harbor significantly clustered somatic missense variation between related studies. CBL, DICER1, TET2 were uniquely identified
by our univariate analysis of recurrent somatic missense variation.
Figure S7: Distribution of Z-scores for each pathogenic dataset relevant to ExAC missense.
Figure S8: PathProx performance (ROC AUC) stratified by CATH domain. There were no significant differences in prediction
performance between CATH domains (ANOVA P = 0.28).
Figure S9: Comparing the performance (ROC AUC) of pathogenic proximity with SIFT (a) and PolyPhen2 (b) across proteins
demonstrates that different methods perform better for different proteins.
Figure S10: Pathogenic proximity scores were calculated for each missense mutation using position relative to pathogenic and
neutral missense variants in RTEL1. (A) Pathogenic proximity was significantly correlated with a decrease in ATPase activity
(Spearman rho=–0.62, P = 0.001), but (B) not significantly correlated with changes in helicase activity (Spearman rho=–0.09, P
= 0.7).
Figure S11: Structural hypotheses about the effects of four novel segregating RTEL1 mutations. (A) Tryptophan 512 is predicted
to lie on the surface of the protein. A mutation to cysteine has the potential to interfere with functionally important proteinprotein interactions. (B) Phenylalanine 559 is buried in the core of the protein, in close proximity to residues predicted to form
part of the DNA-binding cavity, including Alanine 621 and Glutamic Acid 591. Mutation to isoleucine removes steric bulk and is
likely to leave a void in the hydrophobic core of the protein, disrupting structure and reducing stability. (C) Aspartic Acid 719 is
predicted to fall in a surface-exposed helix. Mutation to glycine drastically reduces both the bulk and charge of the protein’s
surface, and likely disrupts the helix at that point. (D) Threonine 55 is predicted to form part of the interface between helices 1
and 2 in RTEL1. Mutation to a serine would reduce the steric bulk and alter the packing between the two helices.
Figure S12: Receiver operating characteristic (ROC) curves for variants in the C-terminal model of RTEL1. The three
pathogenic variants in the C-terminal model were not spatially clustered and predictive performance was notably worse than the
N-terminal model. This result illustrates the relevance of this approach only in situations where nominal evidence for pathogenic
clustering is present.
Figure S13: ATPase and Helicase reported activity (as percentage of wild type) for missense mutations in saXPD and taXPD.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
de Beer, T. a P. et al. Amino Acid changes in disease-associated variants differ radically
from variants observed in the 1000 genomes project dataset. PLoS Comput. Biol. 9,
e1003382 (2013).
Touw, W. G. et al. A series of PDB-related databanks for everyday needs. Nucleic Acids
Res. 43, D364–D368 (2015).
Schueler-furman, O. & Baker, D. Conserved Residue Clustering and Protein Structure
Prediction. 235, 225–235 (2003).
Madabushi, S. et al. Structural clusters of evolutionary trace residues are statistically
significant and common in proteins. J. Mol. Biol. 316, 139–54 (2002).
Capra, J. A. & Singh, M. Predicting functionally important residues from sequence
conservation. Bioinformatics 23, 1875–1882 (2007).
Capra, J. A., Laskowski, R. A., Thornton, J. M., Singh, M. & Funkhouser, T. A.
Predicting protein ligand binding sites by combining evolutionary sequence conservation
and 3D structure. PLoS Comput. Biol. 5, (2009).
Turner, T. N. et al. Proteins linked to autosomal dominant and autosomal recessive
disorders harbor characteristic rare missense mutation distribution patterns. Hum. Mol.
Genet. 24, 5995–6002 (2015).
Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 Update. Hum.
Mutat. 21, 577–581 (2003).
Cogan, J. D. et al. Rare Variants in RTEL1 Are Associated with Familial Interstitial
Pneumonia. Am. J. Respir. Crit. Care Med. 191, 646–655 (2015).
Stuart, B. D. et al. Exome sequencing links mutations in PARN and RTEL1 with familial
pulmonary fibrosis and telomere shortening. Nat. Genet. 47, 512–517 (2015).
Caroline Kannengiesser1, 2*, Raphael Borie3*, Christelle Ménard1, Marion Réocreux1,
P., Nitschké2, 4, Steven Gazal2, 5, 6, Hervé Mal7, Jacques Cadranel8, 9, Hilario Nunes10,
11, D., Valeyre10, 11, Jean François Cordier, 13, Isabelle Callebaut14, Catherine
Boileau1, 2, V. & Cottin12, 13, Bernard Grandchamp1, 2, Patrick Revy15, Bruno Crestani
2, 3. Heterozygous RTEL1 mutations is a major cause of familial pulmonary fibrosis. Eur.
Respir. J. (2015).
Diaz de Leon, A. et al. Telomere lengths, pulmonary fibrosis and telomerase (TERT)
Mutations. PLoS One 5, (2010).
Cronkhite, J. T. et al. Telomere shortening in familial and sporadic pulmonary fibrosis.
Am. J. Respir. Crit. Care Med. 178, 729–737 (2008).
Armanios, M. et al. Short telomeres are a risk factor for idiopathic pulmonary fibrosis.
Proc. Natl. Acad. Sci. U. S. A. 105, 13051–6 (2008).
Armanios, M. et al. Telomerase mutations in families with idiopathic pulmonary fibrosis.
N. Engl. J. Med. 356, 1317–26 (2007).
Tsakiri, K. D. et al. Adult-onset pulmonary fibrosis caused by mutations in telomerase.
Proc. Natl. Acad. Sci. U. S. A. 104, 7552–7 (2007).
Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res.
34, D535–D539 (2006).
Ballew, B. J. et al. Germline mutations of regulator of telomere elongation helicase 1,
RTEL1, in Dyskeratosis congenita. Hum. Genet. 132, 473–480 (2013).
Fan, L., Fuss, J., Cheng, Q., Arvai, A. & Hammel, M. XPD helicase structures and
activities: insights into the cancer and aging phenotypes from XPD mutations. Cell (2008).
20.
Kuper, J., Wolski, S., Michels, G. & Kisker, C. Functional and structural studies of the
nucleotide excision repair helicase XPD suggest a polarity for DNA translocation. EMBO
J. (2012).