Download Patterns and processes of somatic mutations in nine major cancers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Patterns and processes of somatic mutations in nine major cancers
Peilin Jia1,2 and Zhongming Zhao1,2,3,4*
1
Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203,
USA.
2
Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA.
3
Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, USA.
4
Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN 37232, USA.
*
Corresponding author
Email addresses:
PJ: [email protected]
ZZ: [email protected]
1
Supplementary Text
Non-negative Matrix Factorization (NMF)
Given a mutation matrix M, NMF factorizes M into two matrices, W and H, i.e., M96×N =
W96×r,×Hr×N+ε, where r is the factorization rank corresponding to the number of mutational signatures to
be detected, and N is the total number of samples. r is determined by evaluating the cophenetic correlation
and sparseness. In our work, we organized somatic SNVs for each cancer into a 96×N mutation matrix.
For the 6 known mutation types (without considering DNA strands), their sequence contexts were
examined using one base pair ahead and one base pair after the mutation site, i.e., a trinucleotide sequence.
Considering 4 possible nucleotides in each of the two flanking positions, there are a total of 4×6×4 = 96
trinucleotide-based mutation types.
The K-signature and its correlation with increased expression of APOBEC family genes
To further understand the biological significance of the observed K-signature, we systematically
examined its relevant mutation burden versus the expression change of the APOBEC family genes.
Previous studies12 suggested that the C→T mutations in the TpC dinucleotide context related to the
kataegis signature might be associated with the AID/APOBEC mediated DNA repair system. In humans,
the APOBEC family has 11 members29. A positive correlation was established in TCGA_BRCA samples
between the APOBEC3B expression level and the C→T transition burden12 but not in APOBEC3G,
another APOBEC family member gene. We defined the mutation burden per exome regarding the Ksignature as the sum of T(C→T)X and T(C→G)X mutations, including 8 types of trinucleotides. The
overall mutation burden per exome is defined as the sum of all mutations detected in sequences covered
by WES. Among the nine cancers we examined, six had gene expression data, all of which were
generated by TCGA using RNA sequencing (RNA-seq) (https://tcga-data.nci.nih.gov/tcga/). Gene
expression was measured using the normalized count values in the RNA-seq data. For each APOBEC
2
family gene, samples were separated into three groups according to its expression level: low (rank
between 1-33% of the samples), intermediate (rank between 34-66%), and high (rank between 67-100%).
We first examined APOBEC3B. Fig. 3 shows the K-signature related mutation burden versus
APOBEC3B gene expression in three TCGA cancers in which the K-signature was observed: BRCA, EC,
and SQCC. A positive correlation was observed in Fig. 3 between the APOBEC3B expression with both
the K-signature related mutation burden and the overall somatic mutation burden in BRCA and EC, but
not in SQCC. Furthermore, a comparison of the APOBEC3B expression levels in all six TCGA cancers
revealed that on average, APOBEC3B has a generally high expression level in BRCA, EC, and SQCC, a
moderate level in OvCa, and a low level in CRC and GBM (Fig. 3). Notably, the K-signature was
observed in all three cancers with high APOBEC3B gene expression (BRCA, EC, and SQCC) but not in
two cancers with a low expression (CRC and GBM). In OvCa, the APOBEC3B gene expression is
intermediate among the six cancers. The signature #2 of OvCa (Fig. 1) presented high coefficients for
C→T and C→G mutations in TCX trinucleotides, though they did not form a recognizable K-signature.
Put together, these results strongly support the notion of a positive association between the K-signature
related mutation burden and increased APOBEC3B gene expression.
In addition to APOBEC3B, we further found APOBEC3A also had a positive correlation with the
K-signature burden. No other genes in the APOBEC family showed a consistent correlation with the Ksignature. To reduce potential biases caused by other mutation processes (e.g., mutagen-driven or
deficiency in DNA repair genes), we conducted the analysis in all samples and in a subset of samples with
≤ 200 mutations per exome. As shown in Supplementary Table S1 (all samples) and Supplementary Table
S2 (samples with ≤ 200 mutations per exome), APOBEC3A and APOBEC3B consistently showed a
positive correlation between their gene expression level and the T(C→T)X and T(C→G)X burden in three
cancers: BRCA (with the K-signature), EC (with the K-signature), and OvCa (no recognizable Ksignature). However, they did not show the same correlation in SQCC (with the K-signature), CRC
(without the K-signature), or GBM (without the K-signature).
3
Our statistical tests showed that among these cancers, breast tumors had the strongest significant
correlation: comparison between the K-signature mutation burden in the samples with high APOBEC3A
expression versus the samples with low APOBEC3A expression had p = 9.63×10-12, while for APOBEC3B,
p = 8.16×10-10 (two-sided Wilcoxon rank sum test). In EC, although the correlation was only marginally
significant in all samples (p=0.075 for APOBEC3A and p=0.057 for APOBEC3B), it was significant for
samples with ≤ 200 mutations per exome (p=0.037 for APOBEC3A and p=0.023 for APOBEC3B). This
result is likely because that in some EC samples, the mutations were influenced by mutant POLE and/or
aberrant MLH1 expression levels (see below). The correlation remains significant in OvCa (p=6.90×10-4
for APOBEC3A and p=6.17×10-4 for APOBEC3B). In SQCC, however, we did not observe a significant
difference in the K-signature mutation burden versus either APOBEC3A (p=0.525) or APOBEC3B
(p=0.602) gene expression groups. A potential reason for this observation is that tobacco exposure
inflated the mutations at C nucleotides in lung cancer patients. While we could distinguish the tobaccorelated S-signature from the K-signature, it is difficult to determine the proportion of the C→T mutations
that is either induced by the increased APOBEC3A or APOBEC3B expression or shifted by tobacco
exposure.
4
Supplementary Table S1. Mutation burdens (C→T and C→G in the TCX context) versus
expression changes of the APOBEC family genes.
# samples
TCGA_BRCA
TCGA_CRC
TCGA_EC TCGA_GBM
TCGA_OvCa
TCGA_SQCC
500
221
241
150
163
177
0.0139
NA
NA
NA
NA
0.2422
0.8470
0.3630
0.8460
Absolute gene expression
APOBEC1
NA
APOBEC2
0.1565
9.63×10
-12
APOBEC3B
8.16×10
-10
APOBEC3C
APOBEC3A
0.0296
0.0749
0.2682
0.7345
6.897×10
-4
0.5253
-4
0.6015
0.3770
0.0573
0.9448
6.170×10
0.4060
0.4461
0.9035
0.6207
0.3197
0.1775
APOBEC3D
0.8651
0.3400
0.2314
0.7686
0.8727
0.6376
APOBEC3F
0.8025
0.0046
0.3236
0.8114
0.4230
0.4010
APOBEC3G
0.1132
0.0225
0.1379
0.0920
0.3830
0.6015
APOBEC3H
0.1622
0.0971
0.1697
0.1164
0.0635
0.7244
APOBEC4
Both APOBEC3A
and APOBEC3B
NA
NA
0.0114
NA
0.3299
0.6147
1.83×10-10
0.1375
0.0466
0.1378
1.655×10-4
0.2494
0.0034
NA
NA
NA
NA
Gene expression relative to TBP
APOBEC1
NA
APOBEC2
0.0687
0.1810
0.8967
0.4146
0.7532
0.3038
1.27×10
-11
0.0552
0.0760
0.2864
0.0357
0.7528
APOBEC3B
1.45×10
-9
0.2478
0.1351
0.9201
0.0391
0.9635
APOBEC3C
0.5478
0.4662
0.8590
0.8572
0.1545
0.0365
APOBEC3D
0.7176
0.1061
0.3065
0.6157
0.8485
0.5394
APOBEC3F
0.8479
0.0017
0.0566
0.4676
0.5501
0.0362
APOBEC3G
0.1364
0.0064
0.4484
0.2364
0.7370
0.9571
APOBEC3H
0.3513
0.0617
0.0646
0.3893
0.0683
0.1836
APOBEC4
NA
NA
0.0334
NA
0.9165
0.6319
APOBEC3A
The number of samples are those with both somatic mutations and gene expression data. TBP: a
housekeeping gene. p-values < 0.05 are shown in bold.
5
Supplementary Table S2. Mutation burdens (C→T and C→G in the TCX context) versus
expression changes of the APOBEC family genes in samples with ≤ 200 mutations per exome.
# samples
TCGA_BRCA
TCGA_CRC
TCGA_EC
TCGA_GBM
TCGA_OvCa
TCGA_SQCC
481
185
155
149
163
32
Absolute gene expression
APOBEC1
NA
0.6462
NA
NA
NA
NA
APOBEC2
0.1114
0.2351
0.5482
0.3766
0.8460
0.2635
APOBEC3A
1.20×10-10
0.6741
0.0366
0.4161
6.897×10-4
0.0937
APOBEC3B
7.65×10
-10
-4
0.6454
APOBEC3C
0.2281
0.0227
0.7835
6.170×10
0.8998
0.5033
0.3351
0.7536
0.3197
0.8180
APOBEC3D
0.1872
0.5081
0.0805
0.5827
0.8727
0.7174
APOBEC3F
0.6689
0.1823
0.7883
0.8726
0.4230
0.5324
APOBEC3G
0.3864
0.9428
0.5459
0.0729
0.3830
0.5106
APOBEC3H
0.4661
0.7447
0.8959
0.1945
0.0635
0.4495
APOBEC4
Both APOBEC3A
and APOBEC3B
NA
NA
0.0079
NA
0.3299
0.0564
7.51×10-10
0.4236
0.0084
0.1378
1.655×10-4
0.6095
0.3465
NA
NA
NA
NA
Gene expression relative to TBP
APOBEC1
NA
APOBEC2
0.1161
0.3075
0.4463
0.3082
0.7532
0.6692
3.90×10
-11
0.9836
0.0665
0.1990
0.0357
0.2239
APOBEC3B
3.91×10
-10
0.0660
0.0217
0.8362
0.0391
0.1481
APOBEC3C
0.7890
0.4633
0.8667
0.9205
0.1545
0.0525
APOBEC3D
0.2683
0.7005
0.1775
0.6763
0.8485
0.3929
APOBEC3F
0.4546
0.3062
0.3405
0.3747
0.5501
1.0000
APOBEC3G
0.3567
0.7917
0.8510
0.3159
0.7370
0.1883
APOBEC3H
0.7245
0.8616
0.6749
0.5212
0.0683
0.8694
APOBEC4
NA
NA
0.0073
NA
0.9165
0.0383
APOBEC3A
The number of samples are those with both somatic mutations and gene expression data. TBP: a
housekeeping gene. p-values < 0.05 are shown in bold.
6
Supplementary Figure S1. APOBEC3B gene activity in TCGA cancers.
Mutation burdens versus APOBEC3B gene expression in TCGA_BRCA (A), TCGA_EC (B), and
TCGA_SQCC (C) samples are displayed respectively. The mutation burden measured by the K-signature
(the sum of C→T and C→G mutations in TCX) and the total mutation burden were plotted using three
groups of samples with low (rank 1-33%), intermediate (rank 34-66%), and high (rank 67-100%)
expression of APOBEC3B gene. The overall APOBEC3B gene expression in all 6 TCGA cancers was
plotted in (D). Note that we use a relative expression of the APOBEC3B gene by comparing it to the
expression of a housekeeping gene TBP to provide a fair comparison across multiple cancers (D).
7
Supplementary Figure S2. Comparison of mutational signatures in TCGA_CRC and TCGA_EC
using all samples with those using a subset of samples with no mutant POLE.
Mutation burden measured by the K-signature (the sum of C→T and C→G mutations in TCX) and the
total mutation burden were plotted by three groups of samples with low (rank 1-33%), intermediate (rank
34-66%), and high (rank 67-100%) APOBEC3B gene expression.
8
Supplementary Figure S3. Comparison of signatures obtained using all SNVs and those obtained
excluding non-CpG island (CGI) C→T mutations.
In all 7 cancers, the non-CGI signature disappeared when excluding those in non-CGI regions.
9
Supplementary Figure S4. Mutation burden on each sample.
Somatic SNVs in all datasets were detected from whole exome sequencing, i.e., mutation burden per
exome. Note that the scale on Y-axis is different.
10
Related documents