Download Accurate identification of single-nucleotide

brief communications © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. Accurate identification of single-nucleotide variants in whole-genome-amplified single cells Xiao Dong1,4, Lei Zhang1,4, Brandon Milholland1,4, Moonsook Lee1, Alexander Y Maslov1, Tao Wang2 & Jan Vijg1,3 Mutation analysis in single-cell genomes is prone to artifacts associated with cell lysis and whole-genome amplification. Here we addressed these issues by developing single-cell multiple displacement amplification (SCMDA) and a generalpurpose single-cell-variant caller, SCcaller (https://github.com/ biosinodx/SCcaller/). By comparing SCMDA-amplified single cells with unamplified clones from the same population, we validated the procedure as a firm foundation for standardized somatic-mutation analysis in single-cell genomics. Genetic-variant analysis in single cells suffers from artifacts associated with whole-genome amplification (WGA). For example, cytosine deamination due to single-cell lysis and DNA denaturation a at elevated temperatures results in a common CG→TA transition artifact1. This pathway may explain the observed excess of such mutations in single neurons2 compared with unamplified neuronal clones3. Amplification errors may be enriched through allelic amplification bias, a characteristic of multiple displacement amplification (MDA)—the most frequently used method for single-cell WGA4—thus resulting in artifactual mutation calls (Supplementary Fig. 1a,b). Here we report SCMDA and a single-cell-variant caller (SCcaller; http://github.com/biosinodx/SCcaller/ Supplementary Software), which together offer a validated protocol to accurately identify single-nucleotide variants (SNVs) across the genome from a single cell after WGA. To address cytosine-deamination artifacts, SCMDA involves performing cell lysis and DNA denaturation on ice through alkaline lysis. To compensate for the much lower effectiveness of cell lysis and DNA denaturation at low temperatures, we reconfigured the MDA procedure to improve the annealing of hexamer primers (Online Methods). We then developed SCcaller, which corrects for local allelic amplification bias in SNV calling. We validated SCMDA and SCcaller by directly comparing SNVs between amplified single-cell genomes and unamplified DNA from clones, both derived from the same population of earlypassage human primary fibroblasts (Fig. 1). We also sequenced SCMDA-amplified single cells and nonamplified clones derived from the same early-growing clone (approximately five divisions and 20–30 cells), reasoning that there should be substantial b IL2–5 IL11 Kindred clone (IL1C) Real mutations (FNs) SCMDA or De novo during cloning IL12 IL1C No amplification Real mutations (TPs) WGS C1–3 No amplification Artifacts (FPs) Artifacts (FPs) High-temperature MDA WGS HL1-2 Kindred cell (IL11) Kindred cell (IL12) Figure 1 | Experimental design for validating SNV identification in SCMDA-amplified single cells. (a) To assess single-cell amplification and variantcalling accuracy, whole-genome sequencing (WGS) was performed on (i) four SCMDA-amplified single cells (red); (ii) two SCMDA-amplified cells and their nonamplified kindred clone (green); (iii) three additional unamplified clones (blue); and (iv) two single cells amplified after high-temperature lysis (gray). Cell and clone IDs are shown. (b) The kindred cells and clone are expected to have identical genotypes, including both germline and somatic SNVs. Candidate SNVs identified in both the clone and single cells are true positives (TPs); those identified in the clone but in neither cell are false negatives (FNs); and those identified in only one cell are false positives (FPs) (Supplementary Note). These conservative assumptions do not consider de novo mutations in the kindred clone or single cells. 1Department of Genetics, Albert Einstein College of Medicine, Bronx, New York, USA. 2Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, New York, USA. 3Department of Ophthalmology & Visual Sciences, Albert Einstein College of Medicine, Bronx, New York, USA. 4These authors contributed equally to this work. Correspondence should be addressed to J.V. ([email protected]). Received 17 June 2016; accepted 22 February 2017; published online 20 march 2017; doi:10.1038/nmeth.4227 nature methods | ADVANCE ONLINE PUBLICATION | | ADVANCE ONLINE PUBLICATION | nature methods b Cell IL11, SCMDA 1.0 0.8 Sensitivity 0.6 0.4 All SNPs Heterozygous SNPs Default cutoff 0.2 00 M u 0. Tec 83 t Va r 0. Sca 85 n M on 0. ov 75 ar SC c 0. all 31 er FDR: 0. 30 0. 25 0. 20 0. 15 0.1 0.0 t 0 0.2 ec 1,000 0.3 n 2,000 0.4 uT 3,000 0.5 M TPs FPs Ref. 7, MALBAC, n = 3 0.6 ca 5,000 ar f Cell IL12, SCMDA rS ,0 No. artifacts 0 FDR: 40 00 00 ,0 30 ,0 00 20 10 ,0 0 0 1,000 ov Haplotypecaller Va 5.0 2,000 r SCcaller Monovar 3,000 Percentage of overlapping somatic SNVs per cell 1.0 TPs FPs 4,000 on 1.5 Cell IL11, SCMDA 5,000 M 2.0 6,000 M u 0. Tec 78 t Va r 0. Sca 86 n M on 0. ov 81 ar SC c 0. all 39 er 2.5 FP per Mb lle 3.0 d No. somatic SNVs called per cell Ref. 2, MDA, n = 10 3.5 0. 00 0. 30 0. 25 0. 20 0. 15 0. 10 0. 00 0. 05 FP per Mb 0. 05 0. 10 0.0 c 4,000 0.4 0.2 0.0 6,000 0.6 ca Sensitivity 0.8 e Cell IL12, SCMDA 1.0 SC a No. germline SNPs called 6 per cell (×10 ) overlap between the single cells and their kindred clone (Fig. 1). Finally, we also included single-cell libraries prepared with hightemperature lysis and DNA denaturation by using a commercially available kit (Online Methods) to confirm artifactual mutations induced by cytosine deamination. Single cells isolated with the CellRaft system (Supplementary Figs. 2 and 3) were subjected to SCMDA, library preparation and sequencing5 (Online Methods, Supplementary Note and Supplementary Tables 1 and 2). As a prescreen to test for the relative uniformity of amplification, we used real-time PCR at eight loci; 29 of 44 cells (66%) passed our quality-control criteria and were eligible for sequencing (Supplementary Note and Supplementary Table 1). In sequenced single cells, 85% of the genome on average was sequenced to a depth of at least five reads, as compared with approximately 90% in the clones and bulk cell population (Supplementary Table 3). Genome-wide coverage uniformity was evaluated with Lorenz plots, which indicated that, as expected, the unamplified bulk DNA showed the least bias (Supplementary Fig. 4). In addition, amplicon samples produced by SCMDA exhibited less bias than that produced by the commercial high-temperature lysis system (Supplementary Fig. 4) or by other amplification protocols6,7. To call SNVs from the sequencing data, we first predicted the degree of local allelic amplification bias by using heterozygous germline single-nucleotide polymorphisms (hSNPs) (Supplementary Fig. 5a–c). Because MDA starts at random positions and elongates to several kilobases, it is possible to predict the degree of allelic bias at a particular locus by considering the degree of bias in neighboring hSNPs by using kernel smoothing (Online Methods, Supplementary Fig. 6a–d and Supplementary Table 4). We designed SCcaller to adjust allelic amplification bias when estimating the likelihoods of three possibilities—artifact, heterozygous SNV and homozygous SNV—for every candidate SNV locus (Online Methods, Supplementary Fig. 7 and Supplementary Fig. 8a,b). A likelihood-ratio test was used to distinguish real SNVs from artifacts under a certain significance level (α). Its null distribution, corresponding to the amplification errors, was sampled from the input data. This input-specific null accounts for sample-specific amplification bias, sequencing depth and quality. Because the single cells and clone that were derived from kindred cells should share real mutations but not artifacts, we used this system to evaluate the accuracy of SCMDA and SCcaller (Fig. 1b). For calling germline SNPs (i.e., SNPs present in unamplified bulk WGS), SCcaller reached 90.1% sensitivity at a cost of 0.12 false positives (FPs) per million base pairs (Fig. 2a,b). In comparison, Haplotypecaller from GATK8, a commonly used SNP caller for bulk sequencing data, had more than seven times as many FPs (0.87 FPs per million bp). SCcaller performed similarly on a published MDA-amplified single-cell data set from another laboratory2 and substantially outperformed Haplotypecaller and Monovar9 (Fig. 2c and Supplementary Note). We then tested whether SCcaller displays the same high specificity for identifying somatic SNVs (SNVs absent from bulk sequencing) by using our kindred single cells and clone. We found that SCcaller had a much higher specificity (false discovery rate (FDR) of 0.308–0.393) than did MuTect10, VarScan11 and Monovar9 (FDR of 0.745–0.860), most probably because it accounts for amplification errors (Fig. 2d,e). Its sensitivity was either higher than (Monovar) or similar to (MuTect and VarScan) that of the other callers. Sixteen randomly picked SNVs called by SCcaller were No. somatic SNVs called per cell © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. brief communications Figure 2 | Accuracy of SCcaller in single-cell SNV calling. (a,b) Sensitivity and FP rate of germline SNP calling in cell IL11 (a) and IL12 (b). Sensitivity is the ratio of TPs to FNs plus TPs. FP per Mb is the number of FPs per million base pairs. Default cutoff refers to α = 0.01 by SCcaller. (c) SCcaller generates the fewest FPs for germline SNP calling in a data set from ref. 2. The number of artifacts was approximated as the number of SNVs unique to one cell (Supplementary Note). Because unique SNVs also include real somatic SNVs, this approximation is the upper bound of the number of artifacts. (d–f) SCcaller generates few FP somatic-SNV calls in cell IL11 (d) and IL12 (e). FDR is the ratio of FPs to TPs plus FPs. TPs and FPs were derived from the kindred-clone experiment (Fig. 1b and Supplementary Note). (f) Fraction of overlapping somatic SNVs called from MALBAC kindred single cells7 (Supplementary Note). SNPs and SNVs were called in regions with ≥20 sequencing depth. Error bars, s.d. confirmed with Sanger sequencing of the kindred group versus bulk (Supplementary Table 5). On a published single-cell data set obtained by multiple annealing and looping–based amplification cycles (MALBAC)7, SCcaller reported more than two times higher overlapping somatic-SNV calls in all kindred cells than did Monovar, MuTect or VarScan (Fig. 2f and Supplementary Note). Notably, the fraction of overlapping calls in this data set was very low (0.3%) as compared with that obtained with SCMDA (65% on average), probably as a result of our highly optimized experimental protocol for somatic-SNV detection. Hence, SCaller, when applied to different independently published data sets, proved superior to other variant callers, including those specifically designed for single-cell applications. Importantly, SCcaller is limited to genomic regions of SNP heterozygosity (Online Methods). brief communications Percentage of somatic SNVs 4 3 2 1 100 Clone Single cells (SCMDA) 80 Single cells (high temperature) 60 40 20 0 C G AT → G C AT → TA C G → AT C G → G C C G → TA 0 AT → 15 16 17 18 19 20 21 22 13 14 12 11 10 9 8 7 6 5 4 3 2 IL5 IL4 IL3 IL2 IL12 IL11 IL1C C3 C2 C1 1 Clone © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. Single cells c b 5 C lo ne Si s n g (S l e C M ce D ll Si A) s n FD (S gle R CM ce ad D lls ju A (h S ste ig in d) h- gl te e M mp cell D er s A) a tu re log10 (no. somatic SNVs per cell) a Chromosomes Figure 3 | Frequency, spectrum and distribution of somatic SNVs. (a) Number of somatic SNVs per cell after correction for sequence depth and coverage (Supplementary Note). (b) Spectrum of somatic SNVs. In a and b, horizontal lines indicate the average, and n = 4, 6 and 2 for the clones, SCMDA and high-temperature MDA, respectively. (c) Genome distribution of identified somatic SNVs. Each row indicates one single cell or clone, and each dot represents a somatic SNV. Smaller numbers of somatic SNVs were identified in IL4 and IL5, owing to lower sequencing depth (Supplementary Table 3). We then extended somatic-SNV detection by SCMDA and SCcaller from the kindred group to all our single cells and clones. As expected, many more candidate somatic SNVs were detected in cells subjected to the commercial elevated-temperature procedure (average 22,929 per cell) than in those amplified with SCMDA (average of 927 per cell), after adjustment for sequencing depth and coverage (Fig. 3a). Almost all the somatic SNVs detected after high-temperature denaturation were CG→TA transitions (Fig. 3b), thus suggesting that they resulted from cytosine deamination. The number of somatic SNVs identified with SCMDA, on average 927 ± 371 (mean ± s.d.) per cell, was in the same range as that for the unamplified clones, on average 856 ± 306 (mean ± s.d.) per cell. Notably, these results were also in the same range as that previously reported for nonamplified single-cell clones obtained through organoid formation (609 per organoid genome)12. The somatic SNVs identified with SCMDA had a spectrum very similar to those of the clones, with CG→TA mutations comprising only 21.3% of all somatic SNVs (Fig. 3b), and the same spectra were obtained with the other mutation callers (Supplementary Fig. 9). Hence, other callers detected artifacts rejected by SCcaller but did not specifically detect artifacts resulting from cytosine deamination. These combined results underscore the validity of SCMDA and SCcaller in providing reliable mutation identification and reliable mutation spectra for whole-genome-amplified single cells. The somatic SNVs identified in the human fibroblasts were distributed uniformly across the genome (Fig. 3c), and there was a moderate depletion of mutations from functional sequence features, such as exons, DNase I–hypersensitive sites and 3′ untranslated regions, results similar to those found for germline polymorphisms in the 1000 Genomes Project (Supplementary Fig. 10a). This finding is consistent with the presence, in actively proliferating fibroblasts, of considerable selection against deleterious mutations. Somatic SNVs were also depleted from genes expressed in human fibroblasts (P = 0.016 by one-tailed permutation test; Supplementary Fig. 10b), owing to either selection or transcription-coupled repair13. We developed and validated a single-cell WGS procedure and variant caller that allow for the full complement of unique somatic mutations to be accurately determined for a cell. This methodology should greatly expand the capability to explore the landscapes of somatic mutagenesis in humans and other organisms, given that such exploration had previously been possible only through clonal amplification, such as organoid technology. Clonal amplification requires extensive cell culture and essentially limits analysis to stem or progenitor cells. Single-cell technology allows for direct analysis of all cell types, including postmitotic cells such as neurons and muscle fibers. Single-cell clones are also likely to contain de novo mutations. Indeed, each of the single cells in our kindred groups was found to bear a unique set of mutations that were not necessarily artifacts. Hence, in contrast to organoid technology, SCMDA is a general tool for understanding somatic mutagenesis in metazoa, including subclonal heterogeneity in both normal and tumor tissue, and may provide increased insight into the pathogenic roles of somatic mutations in human aging and disease14. Methods Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper. Note: Any Supplementary Information and Source Data files are available in the online version of the paper. Acknowledgments This research was supported by the NIH (grants AG017242, AG047200 and AG038072 to J.V.) and the Glenn Foundation for Medical Research (J.V.). We thank H. Choi (Seoul National University) for providing materials. AUTHOR CONTRIBUTIONS J.V., L.Z. and X.D. conceived this study and designed the experiments. L.Z., B.M. and M.L. performed the experiments. X.D. developed the software. X.D. and T.W. analyzed the data. J.V., X.D., L.Z., B.M. T.W. and A.Y.M. wrote the manuscript. COMPETING FINANCIAL INTERESTS The authors declare competing financial interests: details are available in the online version of the paper. Reprints and permissions information is available online at http://www.nature. com/reprints/index.html. Fryxell, K.J. & Zuckerkandl, E. Mol. Biol. Evol. 17, 1371–1383 (2000). Lodato, M.A. et al. Science 350, 94–98 (2015). Hazen, J.L. et al. Neuron 89, 1223–1236 (2016). Lasken, R.S. Biochem. Soc. Trans. 37, 450–453 (2009). Gundry, M., Li, W., Maqbool, S.B. & Vijg, J. Nucleic Acids Res. 40, 2032–2040 (2012). 6. Fu, Y. et al. Proc. Natl. Acad. Sci. USA 112, 11923–11928 (2015). 7. Zong, C., Lu, S., Chapman, A.R. & Xie, X.S. Science 338, 1622–1626 (2012). 8. McKenna, A. et al. Genome Res. 20, 1297–1303 (2010). 9. Zafar, H., Wang, Y., Nakhleh, L., Navin, N. & Chen, K. Nat. Methods 13, 505–507 (2016). 10. Cibulskis, K. et al. Nat. Biotechnol. 31, 213–219 (2013). 11. Koboldt, D.C. et al. Genome Res. 22, 568–576 (2012). 12. Behjati, S. et al. Nature 513, 422–425 (2014). 13. Hanawalt, P.C. & Spivak, G. Nat. Rev. Mol. Cell Biol. 9, 958–970 (2008). 14. Gundry, M. & Vijg, J. Mutat. Res. 729, 1–15 (2012). 1. 2. 3. 4. 5. nature methods | ADVANCE ONLINE PUBLICATION | ONLINE METHODS SCMDA protocol. The step-by-step SCMDA protocol (description below) can be found in ref. 15 and the Supplementary Protocol. © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. Cell culture. Primary human dermal fibroblasts from a 6-yearold male were provided by H. Choi (Seoul National University). The fibroblasts were collected and protocols were approved as described in ref. 16 . Human dermal fibroblasts were grown at 37 °C, 3% O2 and 10% CO2, in low-glucose DMEM containing 10% FBS, 100 IU ml−1 penicillin, 100 µg ml−1 streptomycin, 2 mM lglutamine and 1% MEM nonessential amino acids (Gibco). Preparing single cells. We prepared single cells and clones by using the CellRaft array (Cell Microsystems). According to the manufacturer’s instructions, we wetted the CellRaft array with medium before plating the cells. The CellRaft array was covered with a proprietary water-soluble biocompatible coating to prevent air-bubble formation during addition of liquid to the array. To wet the array, we added 2 ml of the cell culture medium to the array and waited 3 min, then removed the medium; this process was repeated a total of three times. When cells grew to confluence in a 10-cm plate, we removed the medium, washed the cells with PBS, added 1 ml trypsin (Gibco) and incubated the cells at 37 °C for 5 min. After confirming that all cells were detached, we added 10 ml complete medium to stop trypsinization and transferred the cell solution to a 10-ml tube. Cells were centrifuged for 5 min at 1,500 r.p.m. The supernatant was removed by aspiration, and the cell pellet was resuspended in 10 ml fresh medium. This process was performed a total of two times to ensure the removal of cell debris before single-cell isolation. The CellRaft array contains 12,000 individual sites, called ‘rafts’. To pick single cells, we counted cells with an Auto T4 Cellometer (Nexcelom Bioscience) and prepared approximately 5,000 cells in 3 ml of medium. The cell solution was plated onto the CellRaft array and incubated at 37 °C, 3% O2 and 10% CO2. After 2–4 h, the fibroblasts were elongated and firmly attached to the array. To remove floating cells (which were probably dead) and to avoid contamination from potential cell debris and cell-free DNA in the medium, we removed the medium from the CellRaft array, washed the cells on the CellRaft array twice with 1 ml PBS and added 3 ml fresh complete medium. Single rafts containing one cell (as observed by microscopy with a 10× objective, Supplementary Fig. 2) were transferred with the magnetic wand supplied with the CellRaft system into 0.2-ml PCR tubes containing 2.5 µl PBS. By observing the PCR tubes under a magnifying glass and the CellRaft array under a microscope before and after the transfer, we validated that there was only one cell per tube (Supplementary Fig. 3). Tubes containing single rafts were frozen immediately on crushed dry ice and kept at –80 °C until use. To validate that there was no contamination from cell-free DNA, we also picked empty rafts as negative controls in WGA (described in the section on SCMDA). Single-cell clone and kindred cells. To generate single-cell clones, we used a modified version of a cloning protocol described previously17. Cells were first plated on the CellRaft array as described above. We kept track of rafts containing single cells by using the coordinate markers on the CellRaft. The culturing conditions were the same as those described in the cell culture section above, except that the FBS content of the medium was increased from doi:10.1038/nmeth.4227 10% to 20%. Cells were checked for growth daily, and medium was changed every3 d. When the single-cell clones reached confluence on the raft, i.e. contained approximately 10–16 cells per raft, each single raft containing a clone was transferred to a well in a 96-well plate. After confluence was reached, the clones were transferred to 24-well plates, then 12-well plates, 6-well plates and 10-cm plates; medium containing 20% FBS was used until the cells were transferred to 12-well plates, at which point FBS supplementation was decreased to 10%. To generate kindred single cells and clones, we first grew cells to small clones consisting of approximately 20–30 cells, after transfer to a 96-well plate. We then selected a clone and transferred all of its cells to another CellRaft array. Fifteen kindred cells were isolated from the small clone with the CellRaft system. The remaining cells from the clone were transferred into a well of 96-well plate and grown into a large clone, as described above. We sequenced the unamplified DNA from the large clone, and the amplified DNA from two of the cells was isolated from the 20- to 30-cell-stage clone. Single-cell whole-genome amplification. Frozen single cells in PCR tubes were removed from storage at –80 °C, quickly spun down in a nanocentrifuge and placed on ice. One microliter of Exo-Resistant Random Primer (final concentration 12 µM, Thermo Scientific) was then added, and this was followed by addition of 3 µl lysis buffer containing 400 mM KOH, 100 mM DTT and 10 mM EDTA solution. Cell lysis and DNA denaturation were performed on ice for 10 min. Then 3 µl of stop buffer, consisting of 400 mM HCl and 600 mM Tris-HCl solution (1 M, pH 7.5) was added to neutralize the lysis buffer, and this was followed by the addition of 32 µl of master mix containing 30 µl MDA reaction buffer and 2 µl Phi29 polymerase (REPLI-g UltraFast Mini Kit, Qiagen). MDA was performed in a total volume of 41.5 µl for 1.5 h at 30 °C, and the temperature was then increased to 65 °C for 3 min to inactivate DNA polymerase. After amplification, samples were held at 4 °C until purification. Amplified DNA was purified with AMPureXP-beads (Beckman Coulter), and the concentration was measured with a Qubit High Sensitivity dsDNA Kit (Invitrogen Life Science). As a positive control for the amplification, 1 ng of human genomic DNA in 2.5 µl PBS was also amplified, and 2.5 µl of PBS without any template was used as a negative control. Meanwhile, empty rafts isolated from the CellRaft array were amplified and compared with the controls and single cells to exclude the possibility of contamination by cell-free DNA. High-temperature single cell MDA was done with a REPLI-g Single Cell Kit (Qiagen). The key difference in the single-cell lysis between this procedure and SCMDA is the temperature, i.e., 65 °C for the high-temperature method and on ice for SCMDA. The low temperature used by the SCMDA method avoids heatinduced errors in DNA denaturation, mainly cytosine deamination (CG→TA) (Fig. 3b). Unamplified DNA was extracted from clones with a QuickgDNA Blood Miniprep Kit (ZYMO Research) and from bulk cell cultures with a DNeasy Blood and Tissue Kit (Qiagen). Library construction and whole-genome sequencing. PCR-free libraries were prepared with an Accel-NGS 2S DNA Library Kit (Swift Biosciences). Briefly, 2 µg input DNA was fragmented nature methods © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. with a Covaris sonicator, size-selected and purified with a MinElute Gel Extraction Kit (Qiagen). PCR-free libraries of the DNA from bulk cell populations were prepared with a KAPA LTP Library Preparation Kit (KAPA Biosystems) with TruSeq adapters (Illumina) by the Epigenomics Core at the Albert Einstein College of Medicine. For three of the four single-cell clones (C1, C2 and C3), libraries were prepared by the New York Genome Center. The libraries were sequenced by the Epigenomics Core at the Albert Einstein College of Medicine, on an Illumina HiSeq 2500 platform, with 2 × 100-bp or 2 × 250-bp paired-end reads, or by the New York Genome Center on an Illumina HiSeq X Ten platform, with 2 × 150-bp paired-end reads (Supplementary Note). Sequence alignment. Raw sequence reads were adaptor-trimmed and quality-trimmed with Trim Galore (version 0.3.7), and aligned to human reference genome build 37 with BWA MEM (version 0.7.10)18. After deduplication, the initial mapped reads were indel-realigned on the basis of known indels from the 1000 Genomes Project (phase I)19, and the base quality score was recalibrated on the basis of known indels from the 1000 Genomes Project (phase I) and SNVs from dbSNP (build 144) with GATK (version 3.4.46)8. SCcaller: estimating allelic bias. We used the Nadaraya–Watson kernel-weighted average to estimate the fraction of reads carrying the genotype with more supporting reads (i.e., the major allele fraction, denoted θ) as a result of allelic bias at genome position x0, ∑ K ( x , x ) × qi qˆ(x0 ) = i l 0 i ∑i K l (x0 , xi ) (1) (1) where i denotes heterozygote-germline SNPs (hSNPs) from the single cell. We implemented a module in SCcaller to query these hSNPs from a single cell, comparing them against either a database (for example, dbSNP) or, preferably, hSNPs called from the bulk cell population. Notably, this list of hSNPs does not need to be precise, because bias estimation is a robust procedure, and hSNPs are recalled and confirmed together with all the other SNVs in the variant-calling step. λ denotes half of the window width for smoothing. We recommend λ = 10,000 for MDA-based protocols, to balance prediction accuracy and coverage. The kernel K is defined by using the Epanechnikov quadratic kernel  | x − x 0 | K l ( x0 , x ) = D   l  (2) (2) in which function D is defined as  3 (1 − t 2 ) if |t| ≤ 1  D(t ) =  4 0 otherwise SCcaller: variant calling. Let ‘ref ’ denote the reference allele and ‘alt’ denote the alternative allele. To call the genotype for a single cell, we evaluated the data for a given candidate SNV by using three models: (i) model H0, genotype ref/ref, in which no variant is present at the site, and all alternative bases are a result of amplification errors; (ii) model H1, genotype ref/alt, a heterozygous SNV; and (iii) model H2, genotype alt/alt, a homozygous SNV. We also took sequencing errors (Phred quality score) into account. Assuming that sequencing errors are independent across reads, we calculated the likelihood of each model by using ({ } { } ) ( L(H k ) = P b j | e j , r , m, Fk (q ) = ∏ j P b j | e j , r , m, Fk (q ) ) (4) (4) where j denotes a read covering the site, b denotes the called bases, e denotes the error rate transformed from the Phred quality score, r ∈ {A,C,G,T} denotes the reference genotype, m ∈ {A,C,G,T} denotes the alternative genotype in the data, and Fk is defined as q Fk (q ) =  1 if H1 if H 2 (5) (5) For H1 and H2, let the estimated variant-allele fraction, f, be: if observing more variant reads  Fk (q ) f = 1 − F ( q ) if observing more reference reads k  (6)(6) Following Cibulskis et al.10, we assumed that each type of substitution error in sequencing has an equal probability of occurring, e/3:  f e j / 3 + (1 − f )(1 − e ) if b = r j j  ej /3  P (b j | e j , r , m, Fk (q )) =  f (1 − e j )(1 − f ) if b j = m e / 3 otherwise  j  (7) (7) In H0, to account for amplification errors, we set (3) (3) and in D(t), t refers to the argument of the function D. To test the accuracy of the prediction, we performed leave100-out cross-validations and tenfold cross-validations on hSNPs from a randomly selected region (chromosome 1, 100000000– 110000000). In the leave-100-out cross-validations, we randomly nature methods partitioned the hSNPs into subgroups, each with 100 hSNPs. For each subgroup, we predicted the bias of the 100 hSNPs by using hSNPs from the other subgroups. In the tenfold cross-validations, we randomly partitioned the hSNPs into ten subgroups with equal numbers of hSNPs. For each subgroup, we predicted the bias of the hSNPs by using hSNPs from the other nine subgroups. The strong correlation between prediction and observation (Supplementary Fig. 6 and Supplementary Table 4) indicates high accuracy of the bias estimation. f = 1/ 8 × q (8) (8) This is because with a diploid genome, there are a total of four DNA strands in a cell. Errors that occur at the first round of amplification, which are the artifacts most difficult to filter out, will ideally be found in 1/8 of the reads (Supplementary Fig. 1). We further estimated a null distribution of artifacts by using depth and quality from single-cell sequencing data, and selected criteria η, corresponding to certain α values (0.01 and 0.05 are doi:10.1038/nmeth.4227 both reported). This procedure was based on a likelihood ratiotest that rejects H0 in favor of H1 when Λ = L(H 0 )/ L(H1) ≤ h (9) (9) where ( Λ) ≤ h | H 0 ) = a (10) (10) © 2017 Nature America, Inc., part of Springer Nature. All rights reserved. In addition, we also designed SCcaller to calculate the likelihoods of sequencing errors according to Cibulskis et al.10. SCcaller: application. SCcaller (v1.0) was used for calling SNVs as follows. The bam file of a single cell was subjected to mpileup for each autosome separately in samtools (v1.3, options --C50 --r chr --Osf referencegenome.fa). To estimate allelic bias in the amplification, hSNPs in the single cell were identified by querying dbSNP (build 144) with option --a hsnp of SCcaller. Next, with option --a varcall in SCcaller, the likelihoods of the three models described in the previous section were calculated for all positions in the genome covered by ≥1 reads that differ from the reference genome. Then, using option --a cutoff in SCcaller, η corresponding to α = 0.01 was estimated. This cutoff was used to separate real SNVs from amplification artifacts. In addition, SNVs overlapping with indel reads were filtered out. Finally, we queried all SNVs identified in the whole-genome sequence of the bulk cell population. SNVs that lack variant-supporting reads, do not overlap with indel reads and have ≥20 wild-type reads in the bulk, without overlap with dbSNP, were considered somatic SNVs. Potential caveats in using SCcaller. Notably, there are areas of the genome in which SCcaller cannot overcome amplification bias. Those regions include all areas that are haploid, such as the Y chromosome and the X chromosome in males, and other haploid regions, as well as amplified regions. Of note, the amplification products of MDA are generally long, typically over 10 kb, and not many artifacts will be missed. In SCcaller, we implemented doi:10.1038/nmeth.4227 parameters for lengths of neighboring regions, which can also be specified by the user. With a default setting of 10 kb on both sides of the SNV, 90.5% of the total candidate variants (approximately 90% of genomic regions) are covered. Bias of uncovered regions is by default set to the genome-wide average (θ = 0.75) and can also be specified by the user. Statistical analysis. Statistical tests, such as permutations and t tests, were performed in R (version 3.3.2) and Microsoft Office Excel (2013), respectively. Statistics of SCcaller are described above and were coded in Python (version 2.7). Public single-cell data. Single-cell sequencing data from MDAamplified (previous MDA method) neurons (brain A) 2 and MALBAC-amplified cancer cells7 were downloaded from the Sequence Read Archive under accession numbers SRR2141560– SRR2141570 (previous MDA, ten single neurons and one bulk), SRX204116, SRX202787, SRX202978 and SRX204744 (MALBAC, three kindred single cells and one bulk). Analysis and variant calling were as described for our SCMDA and high-temperature MDA data sets. Code availability. SCcaller is freely available at https://github. com/biosinodx/SCcaller/. Data availability. All sequencing data have been deposited in the NCBI Sequence Read Archive under accession number SRP067062. 15. Dong, X. et al. Protocol Exchange http://dx.doi.org/10.1038/ protex.2017.061 (2017). 16. Park, C.H. et al. J. Invest. Dermatol. 123, 1012–1019 (2004). 17. Falanga, V. et al. J. Invest. Dermatol. 105, 27–31 (1995). 18. Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009). 19. Abecasis, G.R. et al. Nature 491, 56–65 (2012). nature methods

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Accurate identification of single-nucleotide