Download Accurate identification of single-nucleotide

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Electrophysiology wikipedia , lookup

Enzymatic biofuel cell wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Transcript
brief communications
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.
Accurate identification of
single-nucleotide variants
in whole-genome-amplified
single cells
Xiao Dong1,4, Lei Zhang1,4, Brandon Milholland1,4,
Moonsook Lee1, Alexander Y Maslov1, Tao Wang2 &
Jan Vijg1,3
Mutation analysis in single-cell genomes is prone to artifacts
associated with cell lysis and whole-genome amplification.
Here we addressed these issues by developing single-cell
multiple displacement amplification (SCMDA) and a generalpurpose single-cell-variant caller, SCcaller (https://github.com/
biosinodx/SCcaller/). By comparing SCMDA-amplified single
cells with unamplified clones from the same population, we
validated the procedure as a firm foundation for standardized
somatic-mutation analysis in single-cell genomics.
Genetic-variant analysis in single cells suffers from artifacts associated with whole-genome amplification (WGA). For example, cytosine deamination due to single-cell lysis and DNA denaturation
a
at elevated temperatures results in a common CG→TA transition
artifact1. This pathway may explain the observed excess of such
mutations in single neurons2 compared with unamplified neuronal clones3. Amplification errors may be enriched through allelic
amplification bias, a characteristic of multiple displacement amplification (MDA)—the most frequently used method for single-cell
WGA4—thus resulting in artifactual mutation calls (Supplementary
Fig. 1a,b). Here we report SCMDA and a single-cell-variant caller
(SCcaller; http://github.com/biosinodx/SCcaller/ Supplementary
Software), which together offer a validated protocol to accurately
identify single-nucleotide variants (SNVs) across the genome from
a single cell after WGA.
To address cytosine-deamination artifacts, SCMDA involves
performing cell lysis and DNA denaturation on ice through alkaline lysis. To compensate for the much lower effectiveness of cell
lysis and DNA denaturation at low temperatures, we reconfigured
the MDA procedure to improve the annealing of hexamer primers
(Online Methods). We then developed SCcaller, which corrects
for local allelic amplification bias in SNV calling.
We validated SCMDA and SCcaller by directly comparing
SNVs between amplified single-cell genomes and unamplified
DNA from clones, both derived from the same population of earlypassage human primary fibroblasts (Fig. 1). We also sequenced
SCMDA-amplified single cells and nonamplified clones derived
from the same early-growing clone (approximately five divisions
and 20–30 cells), reasoning that there should be substantial
b
IL2–5
IL11
Kindred clone (IL1C)
Real
mutations
(FNs)
SCMDA
or
De novo
during
cloning
IL12
IL1C
No
amplification
Real mutations
(TPs)
WGS
C1–3
No
amplification
Artifacts
(FPs)
Artifacts
(FPs)
High-temperature MDA
WGS
HL1-2
Kindred cell (IL11)
Kindred cell (IL12)
Figure 1 | Experimental design for validating SNV identification in SCMDA-amplified single cells. (a) To assess single-cell amplification and variantcalling accuracy, whole-genome sequencing (WGS) was performed on (i) four SCMDA-amplified single cells (red); (ii) two SCMDA-amplified cells and their
nonamplified kindred clone (green); (iii) three additional unamplified clones (blue); and (iv) two single cells amplified after high-temperature lysis
(gray). Cell and clone IDs are shown. (b) The kindred cells and clone are expected to have identical genotypes, including both germline and somatic
SNVs. Candidate SNVs identified in both the clone and single cells are true positives (TPs); those identified in the clone but in neither cell are false
negatives (FNs); and those identified in only one cell are false positives (FPs) (Supplementary Note). These conservative assumptions do not consider
de novo mutations in the kindred clone or single cells.
1Department of Genetics, Albert Einstein College of Medicine, Bronx, New York, USA. 2Department of Epidemiology & Population Health, Albert Einstein College
of Medicine, Bronx, New York, USA. 3Department of Ophthalmology & Visual Sciences, Albert Einstein College of Medicine, Bronx, New York, USA. 4These authors
contributed equally to this work. Correspondence should be addressed to J.V. ([email protected]).
Received 17 June 2016; accepted 22 February 2017; published online 20 march 2017; doi:10.1038/nmeth.4227
nature methods | ADVANCE ONLINE PUBLICATION | | ADVANCE ONLINE PUBLICATION | nature methods
b
Cell IL11, SCMDA
1.0
0.8
Sensitivity
0.6
0.4
All SNPs
Heterozygous SNPs
Default cutoff
0.2
00
M
u
0. Tec
83 t
Va
r
0. Sca
85 n
M
on
0. ov
75 ar
SC
c
0. all
31 er
FDR:
0.
30
0.
25
0.
20
0.
15
0.1
0.0
t
0
0.2
ec
1,000
0.3
n
2,000
0.4
uT
3,000
0.5
M
TPs
FPs
Ref. 7, MALBAC, n = 3
0.6
ca
5,000
ar
f
Cell IL12, SCMDA
rS
,0
No. artifacts
0
FDR:
40
00
00
,0
30
,0
00
20
10
,0
0
0
1,000
ov
Haplotypecaller
Va
5.0
2,000
r
SCcaller
Monovar
3,000
Percentage of overlapping
somatic SNVs per cell
1.0
TPs
FPs
4,000
on
1.5
Cell IL11, SCMDA
5,000
M
2.0
6,000
M
u
0. Tec
78 t
Va
r
0. Sca
86 n
M
on
0. ov
81 ar
SC
c
0. all
39 er
2.5
FP per Mb
lle
3.0
d
No. somatic SNVs called per cell
Ref. 2, MDA, n = 10
3.5
0.
00
0.
30
0.
25
0.
20
0.
15
0.
10
0.
00
0.
05
FP per Mb
0.
05
0.
10
0.0
c
4,000
0.4
0.2
0.0
6,000
0.6
ca
Sensitivity
0.8
e
Cell IL12, SCMDA
1.0
SC
a
No. germline SNPs called
6
per cell (×10 )
overlap between the single cells and their kindred clone (Fig. 1).
Finally, we also included single-cell libraries prepared with hightemperature lysis and DNA denaturation by using a commercially
available kit (Online Methods) to confirm artifactual mutations
induced by cytosine deamination.
Single cells isolated with the CellRaft system (Supplementary
Figs. 2 and 3) were subjected to SCMDA, library preparation
and sequencing5 (Online Methods, Supplementary Note and
Supplementary Tables 1 and 2). As a prescreen to test for the relative
uniformity of amplification, we used real-time PCR at eight loci; 29 of
44 cells (66%) passed our quality-control criteria and were eligible for
sequencing (Supplementary Note and Supplementary Table 1). In
sequenced single cells, 85% of the genome on average was sequenced
to a depth of at least five reads, as compared with approximately 90%
in the clones and bulk cell population (Supplementary Table 3).
Genome-wide coverage uniformity was evaluated with Lorenz
plots, which indicated that, as expected, the unamplified bulk DNA
showed the least bias (Supplementary Fig. 4). In addition, amplicon
samples produced by SCMDA exhibited less bias than that produced
by the commercial high-temperature lysis system (Supplementary
Fig. 4) or by other amplification protocols6,7.
To call SNVs from the sequencing data, we first predicted
the degree of local allelic amplification bias by using heterozygous germline single-nucleotide polymorphisms (hSNPs)
(Supplementary Fig. 5a–c). Because MDA starts at random positions and elongates to several kilobases, it is possible to predict the
degree of allelic bias at a particular locus by considering the degree
of bias in neighboring hSNPs by using kernel smoothing (Online
Methods, Supplementary Fig. 6a–d and Supplementary Table 4).
We designed SCcaller to adjust allelic amplification bias when estimating the likelihoods of three possibilities—artifact, heterozygous
SNV and homozygous SNV—for every candidate SNV locus (Online
Methods, Supplementary Fig. 7 and Supplementary Fig. 8a,b).
A likelihood-ratio test was used to distinguish real SNVs from
artifacts under a certain significance level (α). Its null distribution,
corresponding to the amplification errors, was sampled from the
input data. This input-specific null accounts for sample-specific
amplification bias, sequencing depth and quality.
Because the single cells and clone that were derived from kindred
cells should share real mutations but not artifacts, we used this
system to evaluate the accuracy of SCMDA and SCcaller (Fig. 1b).
For calling germline SNPs (i.e., SNPs present in unamplified bulk
WGS), SCcaller reached 90.1% sensitivity at a cost of 0.12 false
positives (FPs) per million base pairs (Fig. 2a,b). In comparison, Haplotypecaller from GATK8, a commonly used SNP caller
for bulk sequencing data, had more than seven times as many
FPs (0.87 FPs per million bp). SCcaller performed similarly on
a published MDA-amplified single-cell data set from another
laboratory2 and substantially outperformed Haplotypecaller and
Monovar9 (Fig. 2c and Supplementary Note).
We then tested whether SCcaller displays the same high specificity for identifying somatic SNVs (SNVs absent from bulk sequencing) by using our kindred single cells and clone. We found that
SCcaller had a much higher specificity (false discovery rate (FDR)
of 0.308–0.393) than did MuTect10, VarScan11 and Monovar9 (FDR
of 0.745–0.860), most probably because it accounts for amplification errors (Fig. 2d,e). Its sensitivity was either higher than
(Monovar) or similar to (MuTect and VarScan) that of the other
callers. Sixteen randomly picked SNVs called by SCcaller were
No. somatic SNVs called per cell
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.
brief communications
Figure 2 | Accuracy of SCcaller in single-cell SNV calling. (a,b) Sensitivity
and FP rate of germline SNP calling in cell IL11 (a) and IL12 (b).
Sensitivity is the ratio of TPs to FNs plus TPs. FP per Mb is the number of
FPs per million base pairs. Default cutoff refers to α = 0.01 by SCcaller.
(c) SCcaller generates the fewest FPs for germline SNP calling in a data
set from ref. 2. The number of artifacts was approximated as the number
of SNVs unique to one cell (Supplementary Note). Because unique SNVs
also include real somatic SNVs, this approximation is the upper bound
of the number of artifacts. (d–f) SCcaller generates few FP somatic-SNV
calls in cell IL11 (d) and IL12 (e). FDR is the ratio of FPs to TPs plus FPs.
TPs and FPs were derived from the kindred-clone experiment (Fig. 1b and
Supplementary Note). (f) Fraction of overlapping somatic SNVs called
from MALBAC kindred single cells7 (Supplementary Note). SNPs and SNVs
were called in regions with ≥20 sequencing depth. Error bars, s.d.
confirmed with Sanger sequencing of the kindred group versus
bulk (Supplementary Table 5). On a published single-cell data
set obtained by multiple annealing and looping–based amplification cycles (MALBAC)7, SCcaller reported more than two times
higher overlapping somatic-SNV calls in all kindred cells than did
Monovar, MuTect or VarScan (Fig. 2f and Supplementary Note).
Notably, the fraction of overlapping calls in this data set was very
low (0.3%) as compared with that obtained with SCMDA (65% on
average), probably as a result of our highly optimized experimental
protocol for somatic-SNV detection. Hence, SCaller, when applied
to different independently published data sets, proved superior to
other variant callers, including those specifically designed for single-cell applications. Importantly, SCcaller is limited to genomic
regions of SNP heterozygosity (Online Methods).
brief communications
Percentage of somatic SNVs
4
3
2
1
100
Clone
Single cells (SCMDA)
80
Single cells (high temperature)
60
40
20
0
C
G
AT
→
G
C
AT
→
TA
C
G
→
AT
C
G
→
G
C
C
G
→
TA
0
AT
→
15
16
17
18
19
20
21
22
13
14
12
11
10
9
8
7
6
5
4
3
2
IL5
IL4
IL3
IL2
IL12
IL11
IL1C
C3
C2
C1
1
Clone
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.
Single cells
c
b
5
C
lo
ne
Si
s
n
g
(S l e
C
M ce
D ll
Si A) s
n
FD (S gle
R CM ce
ad D lls
ju A
(h S ste
ig in d)
h- gl
te e
M mp cell
D er s
A) a
tu
re
log10 (no. somatic SNVs per cell)
a
Chromosomes
Figure 3 | Frequency, spectrum and distribution of somatic SNVs.
(a) Number of somatic SNVs per cell after correction for sequence depth
and coverage (Supplementary Note). (b) Spectrum of somatic SNVs.
In a and b, horizontal lines indicate the average, and n = 4, 6 and 2 for
the clones, SCMDA and high-temperature MDA, respectively. (c) Genome
distribution of identified somatic SNVs. Each row indicates one single
cell or clone, and each dot represents a somatic SNV. Smaller numbers
of somatic SNVs were identified in IL4 and IL5, owing to lower
sequencing depth (Supplementary Table 3).
We then extended somatic-SNV detection by SCMDA and
SCcaller from the kindred group to all our single cells and clones.
As expected, many more candidate somatic SNVs were detected in
cells subjected to the commercial elevated-temperature procedure
(average 22,929 per cell) than in those amplified with SCMDA
(average of 927 per cell), after adjustment for sequencing depth
and coverage (Fig. 3a). Almost all the somatic SNVs detected after
high-temperature denaturation were CG→TA transitions (Fig. 3b),
thus suggesting that they resulted from cytosine deamination. The
number of somatic SNVs identified with SCMDA, on average 927
± 371 (mean ± s.d.) per cell, was in the same range as that for the
unamplified clones, on average 856 ± 306 (mean ± s.d.) per cell.
Notably, these results were also in the same range as that previously
reported for nonamplified single-cell clones obtained through
organoid formation (609 per organoid genome)12. The somatic
SNVs identified with SCMDA had a spectrum very similar to those
of the clones, with CG→TA mutations comprising only 21.3% of all
somatic SNVs (Fig. 3b), and the same spectra were obtained with
the other mutation callers (Supplementary Fig. 9). Hence, other
callers detected artifacts rejected by SCcaller but did not specifically detect artifacts resulting from cytosine deamination. These
combined results underscore the validity of SCMDA and SCcaller
in providing reliable mutation identification and reliable mutation
spectra for whole-genome-amplified single cells.
The somatic SNVs identified in the human fibroblasts were
distributed uniformly across the genome (Fig. 3c), and there was
a moderate depletion of mutations from functional sequence
features, such as exons, DNase I–hypersensitive sites and 3′
untranslated regions, results similar to those found for germline
polymorphisms in the 1000 Genomes Project (Supplementary
Fig. 10a). This finding is consistent with the presence, in actively
proliferating fibroblasts, of considerable selection against deleterious mutations. Somatic SNVs were also depleted from genes
expressed in human fibroblasts (P = 0.016 by one-tailed permutation test; Supplementary Fig. 10b), owing to either selection or
transcription-coupled repair13.
We developed and validated a single-cell WGS procedure and
variant caller that allow for the full complement of unique somatic
mutations to be accurately determined for a cell. This methodology should greatly expand the capability to explore the landscapes of somatic mutagenesis in humans and other organisms,
given that such exploration had previously been possible only
through clonal amplification, such as organoid technology. Clonal
amplification requires extensive cell culture and essentially limits
analysis to stem or progenitor cells. Single-cell technology allows
for direct analysis of all cell types, including postmitotic cells such
as neurons and muscle fibers. Single-cell clones are also likely
to contain de novo mutations. Indeed, each of the single cells in
our kindred groups was found to bear a unique set of mutations
that were not necessarily artifacts. Hence, in contrast to organoid
technology, SCMDA is a general tool for understanding somatic
mutagenesis in metazoa, including subclonal heterogeneity in
both normal and tumor tissue, and may provide increased insight
into the pathogenic roles of somatic mutations in human aging
and disease14.
Methods
Methods, including statements of data availability and any associated
accession codes and references, are available in the online version
of the paper.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
Acknowledgments
This research was supported by the NIH (grants AG017242, AG047200 and
AG038072 to J.V.) and the Glenn Foundation for Medical Research (J.V.).
We thank H. Choi (Seoul National University) for providing materials.
AUTHOR CONTRIBUTIONS
J.V., L.Z. and X.D. conceived this study and designed the experiments. L.Z., B.M.
and M.L. performed the experiments. X.D. developed the software. X.D. and T.W.
analyzed the data. J.V., X.D., L.Z., B.M. T.W. and A.Y.M. wrote the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details are available in the
online version of the paper.
Reprints and permissions information is available online at http://www.nature.
com/reprints/index.html.
Fryxell, K.J. & Zuckerkandl, E. Mol. Biol. Evol. 17, 1371–1383 (2000).
Lodato, M.A. et al. Science 350, 94–98 (2015).
Hazen, J.L. et al. Neuron 89, 1223–1236 (2016).
Lasken, R.S. Biochem. Soc. Trans. 37, 450–453 (2009).
Gundry, M., Li, W., Maqbool, S.B. & Vijg, J. Nucleic Acids Res. 40,
2032–2040 (2012).
6. Fu, Y. et al. Proc. Natl. Acad. Sci. USA 112, 11923–11928 (2015).
7. Zong, C., Lu, S., Chapman, A.R. & Xie, X.S. Science 338, 1622–1626 (2012).
8. McKenna, A. et al. Genome Res. 20, 1297–1303 (2010).
9. Zafar, H., Wang, Y., Nakhleh, L., Navin, N. & Chen, K. Nat. Methods 13,
505–507 (2016).
10. Cibulskis, K. et al. Nat. Biotechnol. 31, 213–219 (2013).
11. Koboldt, D.C. et al. Genome Res. 22, 568–576 (2012).
12. Behjati, S. et al. Nature 513, 422–425 (2014).
13. Hanawalt, P.C. & Spivak, G. Nat. Rev. Mol. Cell Biol. 9, 958–970 (2008).
14. Gundry, M. & Vijg, J. Mutat. Res. 729, 1–15 (2012).
1.
2.
3.
4.
5.
nature methods | ADVANCE ONLINE PUBLICATION | ONLINE METHODS
SCMDA protocol. The step-by-step SCMDA protocol (description
below) can be found in ref. 15 and the Supplementary Protocol.
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.
Cell culture. Primary human dermal fibroblasts from a 6-yearold male were provided by H. Choi (Seoul National University).
The fibroblasts were collected and protocols were approved as
described in ref. 16 . Human dermal fibroblasts were grown at 37
°C, 3% O2 and 10% CO2, in low-glucose DMEM containing 10%
FBS, 100 IU ml−1 penicillin, 100 µg ml−1 streptomycin, 2 mM lglutamine and 1% MEM nonessential amino acids (Gibco).
Preparing single cells. We prepared single cells and clones by
using the CellRaft array (Cell Microsystems). According to the
manufacturer’s instructions, we wetted the CellRaft array with
medium before plating the cells. The CellRaft array was covered
with a proprietary water-soluble biocompatible coating to prevent
air-bubble formation during addition of liquid to the array. To wet
the array, we added 2 ml of the cell culture medium to the array
and waited 3 min, then removed the medium; this process was
repeated a total of three times.
When cells grew to confluence in a 10-cm plate, we removed the
medium, washed the cells with PBS, added 1 ml trypsin (Gibco) and
incubated the cells at 37 °C for 5 min. After confirming that all cells
were detached, we added 10 ml complete medium to stop trypsinization and transferred the cell solution to a 10-ml tube. Cells were
centrifuged for 5 min at 1,500 r.p.m. The supernatant was removed
by aspiration, and the cell pellet was resuspended in 10 ml fresh
medium. This process was performed a total of two times to ensure
the removal of cell debris before single-cell isolation.
The CellRaft array contains 12,000 individual sites, called ‘rafts’.
To pick single cells, we counted cells with an Auto T4 Cellometer
(Nexcelom Bioscience) and prepared approximately 5,000 cells in
3 ml of medium. The cell solution was plated onto the CellRaft
array and incubated at 37 °C, 3% O2 and 10% CO2. After 2–4 h,
the fibroblasts were elongated and firmly attached to the array.
To remove floating cells (which were probably dead) and to avoid
contamination from potential cell debris and cell-free DNA in the
medium, we removed the medium from the CellRaft array, washed
the cells on the CellRaft array twice with 1 ml PBS and added
3 ml fresh complete medium. Single rafts containing one cell (as
observed by microscopy with a 10× objective, Supplementary
Fig. 2) were transferred with the magnetic wand supplied with the
CellRaft system into 0.2-ml PCR tubes containing 2.5 µl PBS. By
observing the PCR tubes under a magnifying glass and the CellRaft
array under a microscope before and after the transfer, we validated
that there was only one cell per tube (Supplementary Fig. 3). Tubes
containing single rafts were frozen immediately on crushed dry ice
and kept at –80 °C until use. To validate that there was no contamination from cell-free DNA, we also picked empty rafts as negative
controls in WGA (described in the section on SCMDA).
Single-cell clone and kindred cells. To generate single-cell clones,
we used a modified version of a cloning protocol described previously17. Cells were first plated on the CellRaft array as described
above. We kept track of rafts containing single cells by using the
coordinate markers on the CellRaft. The culturing conditions
were the same as those described in the cell culture section above,
except that the FBS content of the medium was increased from
doi:10.1038/nmeth.4227
10% to 20%. Cells were checked for growth daily, and medium was
changed every3 d. When the single-cell clones reached confluence
on the raft, i.e. contained approximately 10–16 cells per raft, each
single raft containing a clone was transferred to a well in a 96-well
plate. After confluence was reached, the clones were transferred
to 24-well plates, then 12-well plates, 6-well plates and 10-cm
plates; medium containing 20% FBS was used until the cells were
transferred to 12-well plates, at which point FBS supplementation
was decreased to 10%.
To generate kindred single cells and clones, we first grew cells
to small clones consisting of approximately 20–30 cells, after
transfer to a 96-well plate. We then selected a clone and transferred all of its cells to another CellRaft array. Fifteen kindred
cells were isolated from the small clone with the CellRaft system.
The remaining cells from the clone were transferred into a well
of 96-well plate and grown into a large clone, as described above.
We sequenced the unamplified DNA from the large clone, and
the amplified DNA from two of the cells was isolated from the
20- to 30-cell-stage clone.
Single-cell whole-genome amplification. Frozen single cells in
PCR tubes were removed from storage at –80 °C, quickly spun
down in a nanocentrifuge and placed on ice. One microliter
of Exo-Resistant Random Primer (final concentration 12 µM,
Thermo Scientific) was then added, and this was followed by
addition of 3 µl lysis buffer containing 400 mM KOH, 100 mM
DTT and 10 mM EDTA solution. Cell lysis and DNA denaturation were performed on ice for 10 min. Then 3 µl of stop buffer,
consisting of 400 mM HCl and 600 mM Tris-HCl solution (1 M,
pH 7.5) was added to neutralize the lysis buffer, and this was
followed by the addition of 32 µl of master mix containing 30 µl
MDA reaction buffer and 2 µl Phi29 polymerase (REPLI-g
UltraFast Mini Kit, Qiagen). MDA was performed in a total
volume of 41.5 µl for 1.5 h at 30 °C, and the temperature was
then increased to 65 °C for 3 min to inactivate DNA polymerase.
After amplification, samples were held at 4 °C until purification.
Amplified DNA was purified with AMPureXP-beads (Beckman
Coulter), and the concentration was measured with a Qubit High
Sensitivity dsDNA Kit (Invitrogen Life Science).
As a positive control for the amplification, 1 ng of human
genomic DNA in 2.5 µl PBS was also amplified, and 2.5 µl of PBS
without any template was used as a negative control. Meanwhile,
empty rafts isolated from the CellRaft array were amplified
and compared with the controls and single cells to exclude the
possibility of contamination by cell-free DNA.
High-temperature single cell MDA was done with a REPLI-g
Single Cell Kit (Qiagen). The key difference in the single-cell
lysis between this procedure and SCMDA is the temperature, i.e.,
65 °C for the high-temperature method and on ice for SCMDA.
The low temperature used by the SCMDA method avoids heatinduced errors in DNA denaturation, mainly cytosine deamination (CG→TA) (Fig. 3b).
Unamplified DNA was extracted from clones with a QuickgDNA Blood Miniprep Kit (ZYMO Research) and from bulk cell
cultures with a DNeasy Blood and Tissue Kit (Qiagen).
Library construction and whole-genome sequencing. PCR-free
libraries were prepared with an Accel-NGS 2S DNA Library
Kit (Swift Biosciences). Briefly, 2 µg input DNA was fragmented
nature methods
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.
with a Covaris sonicator, size-selected and purified with a
MinElute Gel Extraction Kit (Qiagen). PCR-free libraries
of the DNA from bulk cell populations were prepared with
a KAPA LTP Library Preparation Kit (KAPA Biosystems)
with TruSeq adapters (Illumina) by the Epigenomics Core at
the Albert Einstein College of Medicine. For three of the four
single-cell clones (C1, C2 and C3), libraries were prepared by
the New York Genome Center. The libraries were sequenced by
the Epigenomics Core at the Albert Einstein College of Medicine,
on an Illumina HiSeq 2500 platform, with 2 × 100-bp or 2 × 250-bp
paired-end reads, or by the New York Genome Center on an
Illumina HiSeq X Ten platform, with 2 × 150-bp paired-end reads
(Supplementary Note).
Sequence alignment. Raw sequence reads were adaptor-trimmed
and quality-trimmed with Trim Galore (version 0.3.7), and
aligned to human reference genome build 37 with BWA MEM
(version 0.7.10)18. After deduplication, the initial mapped reads
were indel-realigned on the basis of known indels from the 1000
Genomes Project (phase I)19, and the base quality score was recalibrated on the basis of known indels from the 1000 Genomes
Project (phase I) and SNVs from dbSNP (build 144) with GATK
(version 3.4.46)8.
SCcaller: estimating allelic bias. We used the Nadaraya–Watson
kernel-weighted average to estimate the fraction of reads carrying the genotype with more supporting reads (i.e., the major
allele fraction, denoted θ) as a result of allelic bias at genome
position x0,
∑ K ( x , x ) × qi
qˆ(x0 ) = i l 0 i
∑i K l (x0 , xi )
(1)
(1)
where i denotes heterozygote-germline SNPs (hSNPs) from the
single cell. We implemented a module in SCcaller to query these
hSNPs from a single cell, comparing them against either a database
(for example, dbSNP) or, preferably, hSNPs called from the bulk
cell population. Notably, this list of hSNPs does not need to be
precise, because bias estimation is a robust procedure, and hSNPs
are recalled and confirmed together with all the other SNVs in
the variant-calling step. λ denotes half of the window width for
smoothing. We recommend λ = 10,000 for MDA-based protocols, to balance prediction accuracy and coverage. The kernel
K is defined by using the Epanechnikov quadratic kernel
 | x − x 0 |
K l ( x0 , x ) = D 
 l 
(2)
(2)
in which function D is defined as
 3 (1 − t 2 ) if |t| ≤ 1

D(t ) =  4
0
otherwise
SCcaller: variant calling. Let ‘ref ’ denote the reference allele and
‘alt’ denote the alternative allele. To call the genotype for a single
cell, we evaluated the data for a given candidate SNV by using
three models: (i) model H0, genotype ref/ref, in which no variant
is present at the site, and all alternative bases are a result of amplification errors; (ii) model H1, genotype ref/alt, a heterozygous
SNV; and (iii) model H2, genotype alt/alt, a homozygous SNV. We
also took sequencing errors (Phred quality score) into account.
Assuming that sequencing errors are independent across reads,
we calculated the likelihood of each model by using
({ } { }
)
(
L(H k ) = P b j | e j , r , m, Fk (q ) = ∏ j P b j | e j , r , m, Fk (q )
)
(4) (4)
where j denotes a read covering the site, b denotes the called
bases, e denotes the error rate transformed from the Phred
quality score, r ∈ {A,C,G,T} denotes the reference genotype,
m ∈ {A,C,G,T} denotes the alternative genotype in the data,
and Fk is defined as
q
Fk (q ) = 
1
if H1
if H 2
(5)
(5)
For H1 and H2, let the estimated variant-allele fraction, f, be:
if observing more variant reads
 Fk (q )
f =
1
−
F
(
q
)
if
observing more reference reads
k

(6)(6)
Following Cibulskis et al.10, we assumed that each type of
substitution error in sequencing has an equal probability of
occurring, e/3:
 f e j / 3 + (1 − f )(1 − e ) if b = r
j
j

ej /3

P (b j | e j , r , m, Fk (q )) =  f (1 − e j )(1 − f )
if b j = m
e / 3
otherwise
 j

(7)
(7)
In H0, to account for amplification errors, we set
(3)
(3)
and in D(t), t refers to the argument of the function D.
To test the accuracy of the prediction, we performed leave100-out cross-validations and tenfold cross-validations on hSNPs
from a randomly selected region (chromosome 1, 100000000–
110000000). In the leave-100-out cross-validations, we randomly
nature methods
partitioned the hSNPs into subgroups, each with 100 hSNPs.
For each subgroup, we predicted the bias of the 100 hSNPs by
using hSNPs from the other subgroups. In the tenfold cross-validations, we randomly partitioned the hSNPs into ten subgroups
with equal numbers of hSNPs. For each subgroup, we predicted
the bias of the hSNPs by using hSNPs from the other nine subgroups. The strong correlation between prediction and observation (Supplementary Fig. 6 and Supplementary Table 4)
indicates high accuracy of the bias estimation.
f = 1/ 8 × q
(8)
(8)
This is because with a diploid genome, there are a total of four
DNA strands in a cell. Errors that occur at the first round of amplification, which are the artifacts most difficult to filter out, will
ideally be found in 1/8 of the reads (Supplementary Fig. 1).
We further estimated a null distribution of artifacts by using
depth and quality from single-cell sequencing data, and selected
criteria η, corresponding to certain α values (0.01 and 0.05 are
doi:10.1038/nmeth.4227
both reported). This procedure was based on a likelihood ratiotest that rejects H0 in favor of H1 when
Λ = L(H 0 )/ L(H1) ≤ h
(9)
(9)
where
( Λ) ≤ h | H 0 ) = a
(10)
(10)
© 2017 Nature America, Inc., part of Springer Nature. All rights reserved.
In addition, we also designed SCcaller to calculate the likelihoods
of sequencing errors according to Cibulskis et al.10.
SCcaller: application. SCcaller (v1.0) was used for calling SNVs
as follows. The bam file of a single cell was subjected to mpileup
for each autosome separately in samtools (v1.3, options --C50
--r chr --Osf referencegenome.fa). To estimate allelic bias in the
amplification, hSNPs in the single cell were identified by querying
dbSNP (build 144) with option --a hsnp of SCcaller. Next, with
option --a varcall in SCcaller, the likelihoods of the three models
described in the previous section were calculated for all positions
in the genome covered by ≥1 reads that differ from the reference
genome. Then, using option --a cutoff in SCcaller, η corresponding to α = 0.01 was estimated. This cutoff was used to separate real
SNVs from amplification artifacts. In addition, SNVs overlapping
with indel reads were filtered out. Finally, we queried all SNVs
identified in the whole-genome sequence of the bulk cell population. SNVs that lack variant-supporting reads, do not overlap
with indel reads and have ≥20 wild-type reads in the bulk, without
overlap with dbSNP, were considered somatic SNVs.
Potential caveats in using SCcaller. Notably, there are areas of
the genome in which SCcaller cannot overcome amplification
bias. Those regions include all areas that are haploid, such as the Y
chromosome and the X chromosome in males, and other haploid
regions, as well as amplified regions. Of note, the amplification
products of MDA are generally long, typically over 10 kb, and
not many artifacts will be missed. In SCcaller, we implemented
doi:10.1038/nmeth.4227
parameters for lengths of neighboring regions, which can also be
specified by the user. With a default setting of 10 kb on both sides
of the SNV, 90.5% of the total candidate variants (approximately
90% of genomic regions) are covered. Bias of uncovered regions
is by default set to the genome-wide average (θ = 0.75) and can
also be specified by the user.
Statistical analysis. Statistical tests, such as permutations and
t tests, were performed in R (version 3.3.2) and Microsoft
Office Excel (2013), respectively. Statistics of SCcaller are
described above and were coded in Python (version 2.7).
Public single-cell data. Single-cell sequencing data from MDAamplified (previous MDA method) neurons (brain A) 2 and
MALBAC-amplified cancer cells7 were downloaded from the
Sequence Read Archive under accession numbers SRR2141560–
SRR2141570 (previous MDA, ten single neurons and one bulk),
SRX204116, SRX202787, SRX202978 and SRX204744 (MALBAC,
three kindred single cells and one bulk). Analysis and variant
calling were as described for our SCMDA and high-temperature
MDA data sets.
Code availability. SCcaller is freely available at https://github.
com/biosinodx/SCcaller/.
Data availability. All sequencing data have been deposited
in the NCBI Sequence Read Archive under accession number
SRP067062.
15. Dong, X. et al. Protocol Exchange http://dx.doi.org/10.1038/
protex.2017.061 (2017).
16. Park, C.H. et al. J. Invest. Dermatol. 123, 1012–1019 (2004).
17. Falanga, V. et al. J. Invest. Dermatol. 105, 27–31 (1995).
18. Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
19. Abecasis, G.R. et al. Nature 491, 56–65 (2012).
nature methods