Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Patterns and processes of somatic mutations in nine major cancers Peilin Jia1,2 and Zhongming Zhao1,2,3,4* 1 Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37203, USA. 2 Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37232, USA. 3 Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, USA. 4 Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN 37232, USA. * Corresponding author Email addresses: PJ: [email protected] ZZ: [email protected] 1 Supplementary Text Non-negative Matrix Factorization (NMF) Given a mutation matrix M, NMF factorizes M into two matrices, W and H, i.e., M96×N = W96×r,×Hr×N+ε, where r is the factorization rank corresponding to the number of mutational signatures to be detected, and N is the total number of samples. r is determined by evaluating the cophenetic correlation and sparseness. In our work, we organized somatic SNVs for each cancer into a 96×N mutation matrix. For the 6 known mutation types (without considering DNA strands), their sequence contexts were examined using one base pair ahead and one base pair after the mutation site, i.e., a trinucleotide sequence. Considering 4 possible nucleotides in each of the two flanking positions, there are a total of 4×6×4 = 96 trinucleotide-based mutation types. The K-signature and its correlation with increased expression of APOBEC family genes To further understand the biological significance of the observed K-signature, we systematically examined its relevant mutation burden versus the expression change of the APOBEC family genes. Previous studies12 suggested that the C→T mutations in the TpC dinucleotide context related to the kataegis signature might be associated with the AID/APOBEC mediated DNA repair system. In humans, the APOBEC family has 11 members29. A positive correlation was established in TCGA_BRCA samples between the APOBEC3B expression level and the C→T transition burden12 but not in APOBEC3G, another APOBEC family member gene. We defined the mutation burden per exome regarding the Ksignature as the sum of T(C→T)X and T(C→G)X mutations, including 8 types of trinucleotides. The overall mutation burden per exome is defined as the sum of all mutations detected in sequences covered by WES. Among the nine cancers we examined, six had gene expression data, all of which were generated by TCGA using RNA sequencing (RNA-seq) (https://tcga-data.nci.nih.gov/tcga/). Gene expression was measured using the normalized count values in the RNA-seq data. For each APOBEC 2 family gene, samples were separated into three groups according to its expression level: low (rank between 1-33% of the samples), intermediate (rank between 34-66%), and high (rank between 67-100%). We first examined APOBEC3B. Fig. 3 shows the K-signature related mutation burden versus APOBEC3B gene expression in three TCGA cancers in which the K-signature was observed: BRCA, EC, and SQCC. A positive correlation was observed in Fig. 3 between the APOBEC3B expression with both the K-signature related mutation burden and the overall somatic mutation burden in BRCA and EC, but not in SQCC. Furthermore, a comparison of the APOBEC3B expression levels in all six TCGA cancers revealed that on average, APOBEC3B has a generally high expression level in BRCA, EC, and SQCC, a moderate level in OvCa, and a low level in CRC and GBM (Fig. 3). Notably, the K-signature was observed in all three cancers with high APOBEC3B gene expression (BRCA, EC, and SQCC) but not in two cancers with a low expression (CRC and GBM). In OvCa, the APOBEC3B gene expression is intermediate among the six cancers. The signature #2 of OvCa (Fig. 1) presented high coefficients for C→T and C→G mutations in TCX trinucleotides, though they did not form a recognizable K-signature. Put together, these results strongly support the notion of a positive association between the K-signature related mutation burden and increased APOBEC3B gene expression. In addition to APOBEC3B, we further found APOBEC3A also had a positive correlation with the K-signature burden. No other genes in the APOBEC family showed a consistent correlation with the Ksignature. To reduce potential biases caused by other mutation processes (e.g., mutagen-driven or deficiency in DNA repair genes), we conducted the analysis in all samples and in a subset of samples with ≤ 200 mutations per exome. As shown in Supplementary Table S1 (all samples) and Supplementary Table S2 (samples with ≤ 200 mutations per exome), APOBEC3A and APOBEC3B consistently showed a positive correlation between their gene expression level and the T(C→T)X and T(C→G)X burden in three cancers: BRCA (with the K-signature), EC (with the K-signature), and OvCa (no recognizable Ksignature). However, they did not show the same correlation in SQCC (with the K-signature), CRC (without the K-signature), or GBM (without the K-signature). 3 Our statistical tests showed that among these cancers, breast tumors had the strongest significant correlation: comparison between the K-signature mutation burden in the samples with high APOBEC3A expression versus the samples with low APOBEC3A expression had p = 9.63×10-12, while for APOBEC3B, p = 8.16×10-10 (two-sided Wilcoxon rank sum test). In EC, although the correlation was only marginally significant in all samples (p=0.075 for APOBEC3A and p=0.057 for APOBEC3B), it was significant for samples with ≤ 200 mutations per exome (p=0.037 for APOBEC3A and p=0.023 for APOBEC3B). This result is likely because that in some EC samples, the mutations were influenced by mutant POLE and/or aberrant MLH1 expression levels (see below). The correlation remains significant in OvCa (p=6.90×10-4 for APOBEC3A and p=6.17×10-4 for APOBEC3B). In SQCC, however, we did not observe a significant difference in the K-signature mutation burden versus either APOBEC3A (p=0.525) or APOBEC3B (p=0.602) gene expression groups. A potential reason for this observation is that tobacco exposure inflated the mutations at C nucleotides in lung cancer patients. While we could distinguish the tobaccorelated S-signature from the K-signature, it is difficult to determine the proportion of the C→T mutations that is either induced by the increased APOBEC3A or APOBEC3B expression or shifted by tobacco exposure. 4 Supplementary Table S1. Mutation burdens (C→T and C→G in the TCX context) versus expression changes of the APOBEC family genes. # samples TCGA_BRCA TCGA_CRC TCGA_EC TCGA_GBM TCGA_OvCa TCGA_SQCC 500 221 241 150 163 177 0.0139 NA NA NA NA 0.2422 0.8470 0.3630 0.8460 Absolute gene expression APOBEC1 NA APOBEC2 0.1565 9.63×10 -12 APOBEC3B 8.16×10 -10 APOBEC3C APOBEC3A 0.0296 0.0749 0.2682 0.7345 6.897×10 -4 0.5253 -4 0.6015 0.3770 0.0573 0.9448 6.170×10 0.4060 0.4461 0.9035 0.6207 0.3197 0.1775 APOBEC3D 0.8651 0.3400 0.2314 0.7686 0.8727 0.6376 APOBEC3F 0.8025 0.0046 0.3236 0.8114 0.4230 0.4010 APOBEC3G 0.1132 0.0225 0.1379 0.0920 0.3830 0.6015 APOBEC3H 0.1622 0.0971 0.1697 0.1164 0.0635 0.7244 APOBEC4 Both APOBEC3A and APOBEC3B NA NA 0.0114 NA 0.3299 0.6147 1.83×10-10 0.1375 0.0466 0.1378 1.655×10-4 0.2494 0.0034 NA NA NA NA Gene expression relative to TBP APOBEC1 NA APOBEC2 0.0687 0.1810 0.8967 0.4146 0.7532 0.3038 1.27×10 -11 0.0552 0.0760 0.2864 0.0357 0.7528 APOBEC3B 1.45×10 -9 0.2478 0.1351 0.9201 0.0391 0.9635 APOBEC3C 0.5478 0.4662 0.8590 0.8572 0.1545 0.0365 APOBEC3D 0.7176 0.1061 0.3065 0.6157 0.8485 0.5394 APOBEC3F 0.8479 0.0017 0.0566 0.4676 0.5501 0.0362 APOBEC3G 0.1364 0.0064 0.4484 0.2364 0.7370 0.9571 APOBEC3H 0.3513 0.0617 0.0646 0.3893 0.0683 0.1836 APOBEC4 NA NA 0.0334 NA 0.9165 0.6319 APOBEC3A The number of samples are those with both somatic mutations and gene expression data. TBP: a housekeeping gene. p-values < 0.05 are shown in bold. 5 Supplementary Table S2. Mutation burdens (C→T and C→G in the TCX context) versus expression changes of the APOBEC family genes in samples with ≤ 200 mutations per exome. # samples TCGA_BRCA TCGA_CRC TCGA_EC TCGA_GBM TCGA_OvCa TCGA_SQCC 481 185 155 149 163 32 Absolute gene expression APOBEC1 NA 0.6462 NA NA NA NA APOBEC2 0.1114 0.2351 0.5482 0.3766 0.8460 0.2635 APOBEC3A 1.20×10-10 0.6741 0.0366 0.4161 6.897×10-4 0.0937 APOBEC3B 7.65×10 -10 -4 0.6454 APOBEC3C 0.2281 0.0227 0.7835 6.170×10 0.8998 0.5033 0.3351 0.7536 0.3197 0.8180 APOBEC3D 0.1872 0.5081 0.0805 0.5827 0.8727 0.7174 APOBEC3F 0.6689 0.1823 0.7883 0.8726 0.4230 0.5324 APOBEC3G 0.3864 0.9428 0.5459 0.0729 0.3830 0.5106 APOBEC3H 0.4661 0.7447 0.8959 0.1945 0.0635 0.4495 APOBEC4 Both APOBEC3A and APOBEC3B NA NA 0.0079 NA 0.3299 0.0564 7.51×10-10 0.4236 0.0084 0.1378 1.655×10-4 0.6095 0.3465 NA NA NA NA Gene expression relative to TBP APOBEC1 NA APOBEC2 0.1161 0.3075 0.4463 0.3082 0.7532 0.6692 3.90×10 -11 0.9836 0.0665 0.1990 0.0357 0.2239 APOBEC3B 3.91×10 -10 0.0660 0.0217 0.8362 0.0391 0.1481 APOBEC3C 0.7890 0.4633 0.8667 0.9205 0.1545 0.0525 APOBEC3D 0.2683 0.7005 0.1775 0.6763 0.8485 0.3929 APOBEC3F 0.4546 0.3062 0.3405 0.3747 0.5501 1.0000 APOBEC3G 0.3567 0.7917 0.8510 0.3159 0.7370 0.1883 APOBEC3H 0.7245 0.8616 0.6749 0.5212 0.0683 0.8694 APOBEC4 NA NA 0.0073 NA 0.9165 0.0383 APOBEC3A The number of samples are those with both somatic mutations and gene expression data. TBP: a housekeeping gene. p-values < 0.05 are shown in bold. 6 Supplementary Figure S1. APOBEC3B gene activity in TCGA cancers. Mutation burdens versus APOBEC3B gene expression in TCGA_BRCA (A), TCGA_EC (B), and TCGA_SQCC (C) samples are displayed respectively. The mutation burden measured by the K-signature (the sum of C→T and C→G mutations in TCX) and the total mutation burden were plotted using three groups of samples with low (rank 1-33%), intermediate (rank 34-66%), and high (rank 67-100%) expression of APOBEC3B gene. The overall APOBEC3B gene expression in all 6 TCGA cancers was plotted in (D). Note that we use a relative expression of the APOBEC3B gene by comparing it to the expression of a housekeeping gene TBP to provide a fair comparison across multiple cancers (D). 7 Supplementary Figure S2. Comparison of mutational signatures in TCGA_CRC and TCGA_EC using all samples with those using a subset of samples with no mutant POLE. Mutation burden measured by the K-signature (the sum of C→T and C→G mutations in TCX) and the total mutation burden were plotted by three groups of samples with low (rank 1-33%), intermediate (rank 34-66%), and high (rank 67-100%) APOBEC3B gene expression. 8 Supplementary Figure S3. Comparison of signatures obtained using all SNVs and those obtained excluding non-CpG island (CGI) C→T mutations. In all 7 cancers, the non-CGI signature disappeared when excluding those in non-CGI regions. 9 Supplementary Figure S4. Mutation burden on each sample. Somatic SNVs in all datasets were detected from whole exome sequencing, i.e., mutation burden per exome. Note that the scale on Y-axis is different. 10