Download Separating endogenous ancient DNA from modern day

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Zinc finger nuclease wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Exome sequencing wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Separating endogenous ancient DNA from modern day
contamination in a Siberian Neandertal
Pontus Skoglunda,1, Bernd H. Northoffb,2, Michael V. Shunkovc, Anatoli P. Dereviankoc, Svante Pääbob,
Johannes Krauseb,d, and Mattias Jakobssona,e
a
Department of Evolutionary Biology and eScience for Life Laboratory, Uppsala University, 75236 Uppsala, Sweden; bDepartment of Evolutionary Genetics,
Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany; cPalaeolithic Department, Institute of Archaeology and Ethnography, Russian
Academy of Sciences Siberian Branch, Novosibirsk 630090, Russia; and dInstitute for Archaeological Sciences, University of Tuebingen, 72070 Tuebingen, Germany
Edited by Richard G. Klein, Stanford University, Stanford, CA, and approved December 27, 2013 (received for review October 9, 2013)
| human evolution
R
etrieval and sequencing of DNA from fossil material has
been revolutionized by the rapid surge of high-throughput
sequencing technology (1). One of the advantages of the new
generation of sequencing technologies over previous PCR-based
approaches is the routine sequencing of molecules across their
entire length in contrast to specific priming sites. Although this
approach has been extended to ancient material from a number
of organisms, there is particular interest in genomic analyses of
ancient human populations (2–9). However, despite extensive
precautions against contamination during laboratory sample
preparation, many fossil samples show evidence of contamination
from not only the microenvironment of the fossil, but also present
day humans (2, 10–14). Although the former type of contamination can introduce statistical noise in comparisons of ancient
DNA with modern populations, the latter type of contamination
(from present day humans) can impose major biases on population genetic and phylogenetic analyses (15, 16).
Ancient DNA sequences show several features indicative of
postmortem degradation (PMD) (17). The main diagnostic feature
documented so far is a pattern of Cytosine (C) →Thymine (T)
substitutions (18–20) that increases toward the 5′ end of the sequence reads (21), which in most applications, results in a complementary Guanine (G) →Adenine (A) pattern in the 3′ end
caused by enzymatic repair (5). This pattern has been attributed to
Cytosine deamination at single-stranded ends of the molecules (21)
and shows a clear tendency of increase over time in contrast to
other potential diagnostic features, such as fragment length and
preferential fragmentation at purines (22, 23). However, although
the observation of such a pattern suggests the presence of degraded
DNA in a sequence dataset, it does not prove that modern contamination is absent or even at low frequency. For high-coverage
data from extinct close relatives of modern humans, an efficient
approach to estimate the contamination proportion is to leverage
SNPs that are known to be fixed or nearly fixed in present populations (2, 24). If it can be assumed that contamination is minor
www.pnas.org/cgi/doi/10.1073/pnas.1318934111
Significance
Strict laboratory precautions against present day human DNA
contamination are standard in ancient DNA studies, but contamination is already present inside many ancient human fossils from
previous handling without specific precautions. We designed a
statistical framework to isolate endogenous ancient DNA sequences from contaminating sequences using postmortem degradation patterns and were able to reduce high-contamination
fractions to negligible levels. We captured DNA sequences from
a contaminated Neandertal bone from Okladnikov Cave in Siberia
and used our method to assemble its mitochondrial genome sequence, which we find to be from a lineage basal to five of six
previously published complete Neandertal mitochondrial genomes. Our method paves the way for the large-scale genetic
analysis of contaminated human remains.
Author contributions: P.S., S.P., J.K., and M.J. designed research; P.S., B.H.N., S.P., and J.K.
performed research; M.V.S. and A.P.D. contributed new reagents/analytic tools; P.S. analyzed data; and P.S., S.P., J.K., and M.J. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Data deposition: The sequence reported in this paper has been deposited in the GenBank
database (accession no. KF982693).
1
To whom correspondence should be addressed. E-mail: [email protected].
2
Present address: Institute of Laboratory Medicine, Ludwig Maximilians University, 81377
Munich, Germany.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1318934111/-/DCSupplemental.
PNAS | February 11, 2014 | vol. 111 | no. 6 | 2229–2234
GENETICS
paleogenomics
relative to endogenous DNA, this approach can also be extended to
high-coverage data from ancient modern humans by investigating
whether more alleles are present than expected from the ploidy
level of the chromosome (4, 24, 25) or making assumptions about
the ancestry of both ancient and contaminating individuals (4). For
low-coverage data, stratifying sequences based on whether a mismatch consistent with PMD degradation have occurred, and investigating the consistency of population genetic patterns across
categories can be used to investigate whether contamination is
present at such levels as to affect the conclusions of the analysis (8).
However, these approaches only allow investigation of whether
contamination is present. To our knowledge, no approach has been
developed to overcome significant contamination levels, except for
the case of targeted PCR-based sequencing of haploid regions, for
which most of the informative termini of the fragments are lost (26).
This situation has limited large-scale ancient DNA analyses of human specimens to those specimens with very low levels of present
day DNA contamination, leaving the genetic information in potentially crucial fossil material often unstudied.
Here, we propose a likelihood approach that incorporates
sequence errors and alignment to the reference genome to
provide a score that is informative on whether a given sequence
is likely to have arisen from a degraded template molecule. We
show that this method allows reduction of high contamination
rates down to levels that are negligible for most evolutionary
purposes and give examples using both mitochondrial (mt) and
ANTHROPOLOGY
One of the main impediments for obtaining DNA sequences from
ancient human skeletons is the presence of contaminating modern
human DNA molecules in many fossil samples and laboratory
reagents. However, DNA fragments isolated from ancient specimens
show a characteristic DNA damage pattern caused by miscoding
lesions that differs from present day DNA sequences. Here, we
develop a framework for evaluating the likelihood of a sequence
originating from a model with postmortem degradation—summarized in a postmortem degradation score—which allows the identification of DNA fragments that are unlikely to originate from present
day sources. We apply this approach to a contaminated Neandertal
specimen from Okladnikov Cave in Siberia to isolate its endogenous
DNA from modern human contaminants and show that the reconstructed mitochondrial genome sequence is more closely related to
the variation of Western Neandertals than what was discernible
from previous analyses. Our method opens up the potential for genomic analysis of contaminated fossil material.
genome-wide data from the literature. Finally, we produced a
high-coverage mtDNA dataset from a Siberian Neandertal fossil
excavated in Okladnikov Cave that is contaminated with a substantial amount of modern human DNA and use our method for
separating endogenous sequences from contaminating sequences
to reconstruct its near-complete mitochondrial genome sequence.
A
Results
B
Model Outline. The diagnostic nucleotide misincorporation pattern
that arises from ancient DNA damage can be detected only by
observing matches or mismatches in an alignment with one or more
reference sequences. In the 5′ end of a sequence fragment, as many
as 30–40% of C residues can appear as T in DNA fragments from
samples that are several thousand years old under typical preservation conditions (21, 23). This fraction reflects both Cs that are
deaminated to Uracil, which are read as T by DNA polymerases,
and methylated Cs that are deaminated to Ts. Whether such mismatches are observed, thus, provides information on the authenticity
of a given DNA fragment. However, C→T and G→A mismatches
appear not solely because of ancient DNA damage but also, because
of biological polymorphisms and sequencing errors. We developed
a likelihood framework that explicitly models these three processes
(Material and Methods). To investigate authenticity of a given DNA
fragment, we use this framework to evaluate two competing models,
of which one assumes ancient DNA degradation and the other does
not, arriving at a final log-likelihood ratio of the two models that we
term a PMD score (PMDS). A positive PMDS for a DNA fragment
indicates support for the ancient DNA model relative to the alternative model, and the greater the PMDS, then the greater the support
for the DNA fragment having been subject to DNA degradation.
Empirical Distributions of PMD Scores in Ancient and Present Day
Samples. We computed PMD scores for 1 million randomly sam-
pled sequences from the 100-y-old remains of an Australian individual (27), four ∼5,000-y-old Neolithic Scandinavian individuals
(8), four 38,000- to 70,000-y-old Neandertal individuals (24), and
four present day individuals (24). These individuals showed evidence of PMD roughly proportional to their age (Figs. S1 and S2),
and four PMDS distributions from representative individuals are
shown in Fig. 1. Most importantly, we find that sequences from
ancient individuals have an excess of high PMD scores, implying
that a set of ancient DNA sequences shows enough PMD to be
unlikely to be mistaken for contaminants. For example, looking at
the cumulative distribution of PMDS, we find that, for a threshold
of PMDS > 5, we would retain ∼15–20% of sequences from
Neandertals and Neolithic Scandinavians but only ∼0.01–0.02%
from any present day individual and 0.27% from the 100-y-old
remains of the Australian individual (note that this sample is
derived from hair and thus, may have marginally decreased levels
of DNA damage) (28) (Fig. S3). Thus, an empirical probability
of a sequence being from a present day (contaminating) source
could, in principle, be extrapolated and propagated in downstream
genotyping inference, much like the common practice for sequence
error estimates. However, this procedure is not straightforward,
because the (usually unknown) initial level of contamination in
the library affects the final contamination fraction associated
with each given PMDS threshold. Thus, we will focus on imposing
hard PMDS thresholds to contaminated datasets to be able to
estimate the contamination fraction among the sequences downstream of genotyping.
Artificial Contamination Experiments. To investigate the efficiency
of the PMDS approach for genotyping high-coverage data, we
created mosaic datasets of mitochondrial sequences from a French
present day individual and the Vindija 33.16 Neandertal individual.
These contaminated data were filtered using PMDS thresholds of
three or five and compared with the data with no filtering. We
estimated contamination as the fraction of modern human-specific
alleles using seven diagnostic transversion SNPs and found that
the unfiltered data show increasing rates of modern human alleles
with increasing contamination input (as expected). In contrast,
2230 | www.pnas.org/cgi/doi/10.1073/pnas.1318934111
C
D
Fig. 1. PMDS distribution in ancient and present day human sequences. (A)
A present day human. (B) Hundred-year-old remains of an Australian individual. (C) Five thousand-year-old remains of a Scandinavian individual. (D)
A Pleistocene Neandertal. Note that the x axes in C and D are truncated, and
PMD scores reach ∼26.
a PMDS threshold of five results in no detectable contamination
postfiltering (0.0%) for prefiltering contamination levels (per
base pair) from 1% to ∼93% (Fig. 2A).
To illustrate the use of the approach for autosomal low-coverage data, we performed a similar in silico contamination experiment using genome-wide sequence data from the Neolithic
hunter-gatherers, successively adding French sequences to final
contamination fractions of 20%, 50%, and 90%. We extracted
SNPs typed in 504 individuals from Europe and the Levant (SI
Materials and Methods) from these contaminated datasets as well
as the uncontaminated dataset, performed principal component
analysis for each dataset separately, and merged the obtained
principal component 1 (PC1) –PC2 configurations using Procrustes
transformation (8). As in the original study (8), we find that the
unmodified hunter-gatherer data are outside the distribution of
modern day populations but most similar to Northern European
populations. As we artificially add French contamination, the ancient data are inferred to be successively more similar to the
French population. When we apply a PMDS threshold of three to
the sequences, no effect of contamination can be seen, with the
PC1–PC2 configuration of all filtered datasets showing very similar
projections as the uncontaminated (and unfiltered) data (Fig. 2B).
Contamination Reduction Using PMD Scores in Genuinely Contaminated
Datasets. Whereas the above examples are designed to test the ef-
ficiency of the method in common genetic analyses, the controlled
nature of our artificial contamination experiments allows us to examine the contamination rate as a function of PMDS threshold
directly. We confirm that present day French contamination in the
Vindija 33.16 Neandertal will be negligible for PMDS ≥ 4 (Fig. 3A).
Similarly, for data with a less extreme temporal difference (the 100y-old remains of an Australian individuals as the contaminant and
the 5,000-y-old remains of a Scandinavian hunter-gatherer as endogenous DNA), contamination levels as high as 95% can be
Skoglund et al.
B
A
Mitochondrial Genome Sequence of a Contaminated Neandertal
Specimen from Okladnikov Cave, Siberia. Because the PMDS ap-
proach provides a tool to substantially reduce contamination
levels in ancient DNA, we used primer extension capture (29) to
obtain 6,906 unique mtDNA sequences from a Neandertal sample
from Okladnikov Cave (Okladnikov 2), in which contamination
A
B
limited previous DNA analysis to a small mt region that could be
amplified with Neandertal-specific primers sets (30). We found
clear evidence of PMD consistent with endogenous DNA (Fig.
S5), and 10.2% (95% confidence interval = 8.7–11.7%) of all
fragments overlapping diagnostic positions displayed the modern
day human allele, indicative of contamination. However, when we
restricted the analysis to 3,908 sequences that had a PMDS ≥ 0,
the point estimate of contamination was reduced to 1.3% (95%
confidence interval = 0.7–1.9%) (Fig. 3C). We, thus, discarded sequences with PMDS < 0 and assembled the mitochondrion of
Okladnikov 2 using an ancient DNA-aware iterative mapping assembler (31). Average sequence depth for this assembly was 15fold compared with 25-fold in the unfiltered data, with a total of
96.0% of all positions covered by at least five fragments and 99.4%
covered by at least two fragments (Fig. 4 and Fig. S6). There were
88 positions that were not reliably called for the PMDS ≥ 0
data (0.53%) compared with 24 positions in the unfiltered data
(0.14%). However, we find that the seven sites that were gained
when restricting to PMDS ≥ 0 were all polymorphic and thus,
evolutionary informative (Table S1). In contrast, only 4 of 68 sites
that were lost were polymorphic. This observation shows that
PMDS filtering can resolve positions that are difficult to call
C
Fig. 3. Reduction of artificial and genuine contamination using PMDSs. (A) Expected contamination fraction in the Vindija 33.16 Neandertal as a function of the
PMDS threshold using sequence data from a present day French individual as the artificial contaminant. (B) Expected contamination fraction in the Ajv70 Scandinavian hunter-gatherer as a function of the PMDS threshold using sequence data from the 100-y-old remains of an Australian individual as the artificial contaminant.
The different colored lines in A and B correspond to different contamination levels (before PMDS filtering, the initial contamination fraction is given at PMDS = −2).
(C) Estimated modern human contamination in three genuinely contaminated Neandertal datasets from Feldhofer 2, Mezmaiskaya 2, and Okladnikov 2.
Skoglund et al.
PNAS | February 11, 2014 | vol. 111 | no. 6 | 2231
ANTHROPOLOGY
reduced to negligible levels by increasing the threshold on the PMDS
statistic (Fig. 3B).
However, because the data used for our artificial contamination experiment may not perfectly represent genuine cases of
contamination, we also investigated two genuinely contaminated
Neandertal mtDNA datasets (10): Feldhofer 2 and Mezmaiskaya
2. In line with the artificial contamination experiment above, we
find strong reductions from the initial contamination level of
∼79% (Feldhofer 2) and ∼45% (Mezmaiskaya 2) for both these
datasets (Fig. 3C) (10). Moreover, we find that consensus sequences
from the original (and contaminated) Mezmaiskaya 2 dataset falls
within the variation of modern humans, whereas the consensus of
sequences post-PMDS filtering falls within Neandertal variation
(Fig. S4), similar to results from a less contaminated sample from
the same individual (29).
GENETICS
Fig. 2. Artificial mtDNA and autosomal contamination experiments. (A) Artificial contamination represented by sequences from a present day human (French)
was added to a sequence dataset from a Croatian Neandertal (Vindija 33.16). Final per base pair contamination was estimated using diagnostic mtDNA positions
that differentiate Neandertals from modern humans. Up to 90% per base pair contamination levels can be reduced to negligible levels using the PMDS approach.
(B) Artificial contamination represented by sequences from a present day human (French) was added to a sequence dataset from Neolithic Scandinavian huntergatherers. SNPs were extracted with and without filtering for PMDS, and a principal component analysis together with European and Levantine populations was
performed for the different levels of contamination. The PCs were Procrustes transformed for each separate analysis (8).
A
B
C
Fig. 4. A mt genome sequence from a contaminated Neandertal sample from Okladnikov Cave, Siberia. (A) Coverage distribution and sequence features of
the Okladnikov 2 mt genome. (B) Gene genealogy of 67 published complete hominid mitochondria and Okladnikov 2. Node support is only indicated for
posterior probabilities ≥ 0.75. (C) Geographical sampling location of Neandertal and Denisovan individuals from which mtDNA genomes have been sequenced. Okladnikov Cave is marked by a red square. GC, Guanine-Cytosine content.
because of different alleles in the contaminant and endogenous
sequences (SI Results).
We aligned the Okladnikov 2 mt genome sequence to complete
mt genomes from other archaic and modern humans and reconstructed a gene tree using a Bayesian approach (Material and
Methods). We find that the Okladnikov 2 sequence falls outside all
published complete Neandertal mt genomes from Central and
Western Europe with high posterior probability but forms a cluster
with these mt genomes to the exclusion of the mtDNA of Mezmaiskaya 1 from the Caucasus (Fig. 4). This topology is also obtained when a PMD filter is also applied to Mezmaiskaya 1, Vindija
33.16, Vindija 33.25, and Vindija 33.26 sequences (Fig. S7). The
mtDNA genome sequence reconstructed here, thus, resolves the
relationship between Okladnikov 2 and other Neandertal mt
genomes, because analyses based on the control region have suffered from statistical uncertainty but indicated that the Okladnikov
2 mtDNA is basal to Mezmaiskaya 1 (30, 32), a result that we replicated by extracting a 304-bp region of the Okladnikov 2 mtDNA
sequence assembly that overlapped with previous data (Fig. S8).
2232 | www.pnas.org/cgi/doi/10.1073/pnas.1318934111
Discussion
Despite that modern human contamination has been shown to
be present in a large proportion of archaeological human samples (2, 11, 33), large-scale DNA sequence retrieval has so far
been limited to hominin specimens with very little contamination. Here, we developed a likelihood framework for large-scale
ancient DNA studies that tries to gauge how likely a single sequence is to have originated from an ancient template molecule,
incorporating base quality scores and biological polymorphism.
We have shown that this method allows reduction of contamination in degraded ancient human sequence data to negligible
levels. In contrast to a previously suggested computational approach for PCR-based amplification of targeted regions (26), our
approach is particularly geared for the new generation of highthroughput sequencing approaches (1).
This framework offers many advantages over previous approaches, which aimed at detecting contamination by dividing
sequences based on fragment length or observed mismatches (8,
15). Most notably, information is provided by not only mismatches
Skoglund et al.
Skoglund et al.
still be possible to investigate how many source individuals are
represented using mtDNA, but because mtDNA/nuclear DNA
ratios can vary greatly between samples (16), the most robust approach will likely be to stratify sequences based on the PMDS
approach described here and assess whether population genetic
inferences are consistent across different thresholds (cf. ref. 8 with
Fig. 2). If the fraction of contamination can be estimated, we
suggest that a PMDS threshold should be chosen such that contamination is negligible. If data are too sparse for such estimates,
we suggest that main conclusions from the data should be robust
for a range of PMDS thresholds.
Whereas we have focused on human population genetic examples, contamination from present day sources also affects nonhuman
ancient DNA studies of pathogens, ancient microbiomes, animals,
and plants, and thus, computational removal of contamination in
ancient DNA data is likely to increase in importance as these fields
move to high-throughput sequencing approaches.
PðMatchjzÞ = ð1 − πÞ × ð1 − «Þ × ð1 − Dz Þ + ð1 − πÞ × « × Dz + π × « × ð1 − Dz Þ [1]
under a model of PMD (MPMD), which under a model of no PMD (MNULL,
where Dz = 0 for all z), becomes P(Matchjz) = (1 − π) × (1 − «) + « × π. The
probability of a C→T mismatch (or a G→A mismatch) for a particular site is
then any other event or combination of events, such that
PðMismatchjzÞ = 1 − PðMatchjzÞ:
[2]
In all applications here, we will assume that polymorphisms occur at a rate of
π = 0.001 to approximate autosomal genetic variation between a pair of
human chromosomes. We obtain « from the phred-scaled base qualities Q,
but because we restrict to a specific type of mismatch at each type of site
(C→T or G→A), we divide the total probability of a sequence error given by
the base qualities by three, such that « = 1/3 × 10−Q/10. For the probability of
postmortem nucleotide misincorporation, we use empirically observed patterns as the basis (21), and we approximate the probability by
Dz = ð1 − pÞz−1 × p + C,
[3]
where z is the distance from the relevant sequence terminus (the 5′ end for
C→T mismatches and the 3′ end for G→A mismatches), and the first position
has z = 1. In this study, we assumed P = 0.3 and C = 0.01, which are consistent
with most sequence datasets that are several thousand years old (Fig. S1),
PNAS | February 11, 2014 | vol. 111 | no. 6 | 2233
GENETICS
Materials and Methods
Likelihood-Based Framework for Separating Ancient DNA from Present Day
Contamination. C deaminations in ancient DNA sequences can be identified
by mismatches to the reference sequence of two specific types (18–20). We,
therefore, restrict our model to two categories of sites in a pairwise alignment between a DNA fragment and a reference sequence: (i) where the
reference has a C and the aligned fragment has either a C or a T and (ii)
where the reference base is G and the aligned base is either a G or an A.
Other positions in the alignment are not considered, except for providing
the distance z to the 5′ (in the case of a C in the reference) and 3′ (in the case
of a G in the reference) termini. For obtaining z, positions in the alignment
where there is a gap in the sequenced DNA fragment are not considered. In
SI Materials and Methods, we outline a modification of the model for libraries prepared using single-stranded methods (e.g., ref. 5).
There are three nonmutually exclusive events that can cause an observation of a C→T or a G→A mismatch at a given position in a sequence read
where the reference sequence displays a C or a G: (i) a true biological
polymorphism (occurring at rate-π), (ii) a sequence error (rate-«), or (iii) in
the case of degraded DNA, PMD (rate Dz), which we will assume varies
according to a modified geometric distribution (plus a small constant) across
the sequenced molecule with decreasing probability with distance z (measured in base pairs) from the relevant terminus (21) (Fig. S1). Alternatively,
observing a C-C match could be caused by a sequence error having reverted
a postmortem nucleotide misincorporation or a sequence error having
reverted a biological polymorphism. The third combination of events that
could be imagined—a nucleotide misincorporation having reverted a biological polymorphism—does not enter here, because nucleotide misincorporations only result in substitutions in one direction (i.e., C→T or G→A but not
T→C or A→G).
Considering all three mutually exclusive possibilities together, we have
ANTHROPOLOGY
typical of PMD (8) but also, matches to the reference. For example, observing several C-C matches in the terminus of a fragment provides support for the model without PMD, which could,
for example, allow the exclusion of contaminant fragments that
have a single C→T mismatch because of sequencing error or biological polymorphism. Furthermore, probabilistic weighting of
each observed match or mismatch according to its position along
the sequenced fragment allows balanced consideration of the evidence for PMD across the entire fragment. For example, observing a single C→T mismatch at the third 5′ position of the
fragment may provide as much evidence for the sequence fragment being of ancient origin as observing several such mismatches
at a more central position of the DNA fragment. Finally, explicit
consideration of the probability of sequencing error at each informative position removes the need for strict base quality cutoffs.
Accordingly, we found that simply conditioning on observing
a C→T mismatch in the sequence (8) does not achieve as high of
a degree of reduction in contamination rate as applying a PMDS
threshold (SI Results and Fig. S9). We also note that we, for
consistency, assumed a C deamination model corresponding to
∼30% C to T substitutions in the first 5′ positions throughout
this study (Fig. S1), but in cases where endogenous DNA is more
degraded, additional efficiency might be gained by modifying this
model (optional in our software implementation PMDtools).
One cause of potential concern is if either contaminant or endogenous DNA is closer to the reference sequence than the other,
in which case the number of mismatches caused by biological
polymorphisms is not the same. In this paper, we assume equal
divergence from the reference for the two models, because the
observed biological polymorphism is, in most cases, much smaller
than the rate of PMD; however, the model easily accommodates
different assumptions on biological polymorphisms. As the catalog
of genetic variation from, for example, humans becomes more
complete, additional improvements to our approach might include
masking positions with known polymorphisms, although it would
also increase the computational expense. Another complication
from enriching for postmortem damage is the errors that it
contributes to downstream analyses, but we found that this could
be alleviated by error correction (SI Results and Fig. S10).
We used the methodology described in this paper to retrieve
a complete mtDNA sequence from a contaminated Siberian Neandertal sample. The Okladnikov 2 sequence is basal to all published complete Neandertal mtDNA sequences from Central and
Western Europe, which may relate to it originating from the
easternmost Neandertal sample available (30). In addition to the
complete mitochondrion of the Mezmaiskaya 1 Neandertal from
the Caucasus (29, 34), analyses of smaller fragments of the mt
control region suggest that older Neandertals might contain
mtDNA lineages basal to the Okladnikov 2 sequence (30, 32), but
the discrepancy between the genealogy obtained using a smaller
hypervariable section (Fig. S8) and the resolved phylogeny using
the mtDNA genome sequence prompts additional large-scale
studies of these remains.
Although chronological age seems to be the most important
factor (23), PMD patterns in contaminant and endogenous sequences will likely be influenced by a complex range of factors, such
as temperature, depositional conditions, postexcavation handling,
and source tissue (17). Furthermore, it is possible that PMD patterns are also present in contaminant DNA. However, a recent
study (23) showed that samples 50–100 y old all showed limited
evidence of PMD (less than 10% nucleotide misincorporations in
the 5′ end). Thus, while the relative levels of PMD in contaminant
vs. endogenous sequences in a sample that contains a mixture of
both will usually not be known, as long as substantial patterns of
PMD are found in a sequence library (e.g., >10%), it seems reasonable to assume that PMD will be less extensive in the contaminating sequences under most circumstances. We suggest that, after
identifying a set of sequences with high PMD scores that are likely
to be endogenous, high- to medium-coverage data should be authenticated by analyzing whether the retained sequence reads are
from a single individual (4, 10, 24). For low-coverage data, it may
but note that, for older specimens with pervasive damage, greater efficiency
could be gained by increasing p. We set up a likelihood function of
ðLðMPMD jSi Þ = X × PðSi = MatchjMPMD Þ + ð1 − XÞ × PðSi = MismatchjMPMD Þ, [4]
where X is an indicator variable. Therefore, X = 1 if Si = Match, and X = 0 if Si =
Mismatch; the probabilities are given by Eqs. 1 and 2. The likelihood function
for model MNULL is set up in the same way, but in this case, Dz = 0 for all z. To
investigate whether an alignment is more likely to have originated from
a model with PMD (MPMD) or a model with no PMD (MNULL), we take the
natural logarithm of the ratio of the likelihood for each model multiplied over
all positions I in the sequence read S that satisfy our criteria. This log-likelihood
ratio provides a PMDS,
PMDS = log
!
∏Ii L MPMD jSi
,
I ∏i L MNULL jSi
[5]
which is informative on how likely a sequence is to have originated from
a degraded source. If I = 0, our test statistic is undefined for that particular
sequence read, and in practice, such (extremely rare) reads are discarded.
Capture and Sequencing of Okladnikov 2 mtDNA. DNA was extracted (35) from
about 200 mg bone from a humerus shaft found in Okladnikov Cave in the
Altai Mountains in the 1980s. Before mtDNA enrichment, 25 μL DNA extract
were turned into a sequencing library using a modified 454 library preparation protocol as described (29, 36), with the exception that Illumina p5 and
p7 adapters were used (37). The p7 adapter was modified to avoid contamination from other libraries carrying a unique 7-bp barcode. The primer
extension capture mtDNA enrichment protocol was performed as described
previously (29) (SI Materials and Methods) using the exact same four 144plex and 143-plex mixes used to retrieve complete Neandertal, Denisovan,
and Pleistocene modern human mtDNAs previously (10, 29, 38). The sequencing run was analyzed starting from raw images using the Illumina
Genome Analyzer pipeline 1.3.2 and the base caller Ibis (39). Raw sequences
1. Stoneking M, Krause J (2011) Learning about human population history from ancient
and modern genomes. Nat Rev Genet 12(9):603–614.
2. Green RE, et al. (2006) Analysis of one million base pairs of Neanderthal DNA. Nature
444(7117):330–336.
3. Noonan JP, et al. (2006) Sequencing and analysis of Neanderthal genomic DNA. Science 314(5802):1113–1118.
4. Rasmussen M, et al. (2010) Ancient human genome sequence of an extinct PalaeoEskimo. Nature 463(7282):757–762.
5. Meyer M, et al. (2012) A high-coverage genome sequence from an archaic Denisovan
individual. Science 338(6104):222–226.
6. Fu Q, et al. (2013) DNA analysis of an early modern human from Tianyuan Cave,
China. Proc Natl Acad Sci USA 110(6):2223–2227.
7. Keller A, et al. (2012) New insights into the Tyrolean Iceman’s origin and phenotype as
inferred by whole-genome sequencing. Nat Commun 3:698.
8. Skoglund P, et al. (2012) Origins and genetic legacy of Neolithic farmers and huntergatherers in Europe. Science 336(6080):466–469.
9. Sánchez-Quinto F, et al. (2012) Genomic affinities of two 7,000-year-old Iberian
hunter-gatherers. Curr Biol 22(16):1494–1499.
10. Krause J, et al. (2010) A complete mtDNA genome of an early modern human from
Kostenki, Russia. Curr Biol 20(3):231–236.
11. Richards MB, Sykes BC, Hedges RE (1995) Authenticating DNA extracted from ancient
skeletal remains. J Archaeol Sci 22(2):291–299.
12. Handt O, Krings M, Ward RH, Pääbo S (1996) The retrieval of ancient human DNA
sequences. Am J Hum Genet 59(2):368–376.
13. Kolman CJ, Tuross N (2000) Ancient DNA analysis of human populations. Am J Phys
Anthropol 111(1):5–23.
14. Sampietro ML, et al. (2006) Tracking down human contamination in ancient human
teeth. Mol Biol Evol 23(9):1801–1807.
15. Wall JD, Kim SK (2007) Inconsistencies in Neanderthal genomic DNA sequences. PLoS
Genet 3(10):1862–1866.
16. Green RE, et al. (2009) The Neandertal genome and ancient DNA authenticity. EMBO
J 28(17):2494–2502.
17. Pääbo S (1989) Ancient DNA: Extraction, characterization, molecular cloning, and
enzymatic amplification. Proc Natl Acad Sci USA 86(6):1939–1943.
18. Hofreiter M, Jaenicke V, Serre D, von Haeseler A, Pääbo S (2001) DNA sequences from
multiple amplifications reveal artifacts induced by cytosine deamination in ancient
DNA. Nucleic Acids Res 29(23):4793–4799.
19. Hansen AJ, Willerslev E, Wiuf C, Mourier T, Arctander P (2001) Statistical evidence for
miscoding lesions in ancient DNA templates. Mol Biol Evol 18(2):262–265.
20. Brotherton P, et al. (2007) Novel high-resolution characterization of ancient DNA
reveals C > U-type base modification events as the sole cause of post mortem miscoding lesions. Nucleic Acids Res 35(17):5717–5728.
21. Briggs AW, et al. (2007) Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA 104(37):14616–14621.
2234 | www.pnas.org/cgi/doi/10.1073/pnas.1318934111
obtained from Ibis for the two paired end reads of each sequencing cluster
were merged (including adapter removal) as previously described (24, 40).
Assembly and Analysis of the Okladnikov 2 mtDNA Genome Sequence. In total,
4,102,487 sequences were obtained after merging read pairs (unmerged
reads were excluded). After collapsing PCR duplicates to consensus sequences
(40), we obtained 6,906 sequences of 50,822 originally aligned sequences.
We assembled the mtDNA sequence using an iterative mapping assembler
(31), which takes position-specific PMD patterns into consideration in its
scoring matrix. Convergence was observed in two rounds of assembly. For
estimating contamination in the Okladnikov 2 sequence, we increased the
number of informative sites by identifying positions (with minimum depth
of 10) where the consensus base differed from 95% of 311 modern human
mtDNAs and where the Okladnikov 2 consensus was not an A or T in
a transition polymorphism. The seven fixed transversions between Neandertals and modern humans yielded a similar estimate of ∼10% contamination in Okladnikov 2. We inferred a gene tree relating 68 hominin mtDNA
sequences (25, 29, 31, 38, 41) using mrBayes 3.2.0 (42). The Markov chain
Monte Carlo was run for 5 million generations, with sampling every 1,000
generations and a burn-in of 1 million generations. A consensus tree was
constructed implementing a 50% majority rule.
ACKNOWLEDGMENTS. We thank Wieland Binzcik, Susanna Sawyer, Adrian
Briggs, Philip Johnson, Allissa Mittnik, Ludovic Orlando, Amhed Velazquez, Per
Sjödin, Lucie Gattepaille, Robert Ekblom, Ayça Omrak, Anders Götherström,
and Love Dalén for discussions and technical assistance. PMDtools, a python
program implementing PMDS computations, is available at www.ebc.uu.se/
Research/IEG/evbiol/research/Jakobsson/software/pmdtools/. Computations were
performed on resources provided by the Swedish National Infrastructure for Computing and Uppsala Multidisciplinary Center for Advanced Computational Science under Project b2010050. The laboratory
work was financed by the Max Planck Society. Financial support was also
provided by the Royal Swedish Academy of Sciences (P.S.) and the European
Research Council (M.J.).
22. García-Garcerà M, et al. (2011) Fragmentation of contaminant and endogenous DNA
in ancient samples determined by shotgun sequencing; prospects for human palaeogenomics. PLoS One 6(8):e24161.
23. Sawyer S, Krause J, Guschanski K, Savolainen V, Pääbo S (2012) Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA. PLoS One 7(3):e34131.
24. Green RE, et al. (2010) A draft sequence of the Neandertal genome. Science
328(5979):710–722.
25. Reich D, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave
in Siberia. Nature 468(7327):1053–1060.
26. Helgason A, et al. (2007) A statistical approach to identify ancient template DNA.
J Mol Evol 65(1):92–102.
27. Rasmussen M, et al. (2011) An Aboriginal Australian genome reveals separate human
dispersals into Asia. Science 334(6052):94–98.
28. Gilbert MTP, et al. (2007) Whole-genome shotgun sequencing of mitochondria from
ancient hair shafts. Science 317(5846):1927–1930.
29. Briggs AW, et al. (2009) Targeted retrieval and analysis of five Neandertal mtDNA
genomes. Science 325(5938):318–321.
30. Krause J, et al. (2007) Neanderthals in central Asia and Siberia. Nature 449(7164):
902–904.
31. Green RE, et al. (2008) A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134(3):416–426.
32. Dalén L, et al. (2012) Partial genetic turnover in neandertals: Continuity in the East
and population replacement in the West. Mol Biol Evol 29(8):1893–1897.
33. Malmström H, Storå J, Dalén L, Holmlund G, Götherström A (2005) Extensive human
DNA contamination in extracts from ancient dog bones and teeth. Mol Biol Evol
22(10):2040–2047.
34. Ovchinnikov IV, et al. (2000) Molecular analysis of Neanderthal DNA from the
northern Caucasus. Nature 404(6777):490–493.
35. Rohland N, Hofreiter M (2007) Comparison and optimization of ancient DNA extraction. Biotechniques 42(3):343–352.
36. Margulies M, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380.
37. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):t5448.
38. Krause J, et al. (2010) The complete mitochondrial DNA genome of an unknown
hominin from southern Siberia. Nature 464(7290):894–897.
39. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome
Analyzer using machine learning strategies. Genome Biol 10(8):R83.
40. Kircher M (2012) Analysis of High-Throughput Ancient DNA Sequencing Data.
Ancient DNA (Springer, Berlin), pp 197–228.
41. Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408(6813):708–713.
42. Ronquist F, et al. (2012) MrBayes 3.2: Efficient Bayesian phylogenetic inference and
model choice across a large model space. Syst Biol 61(3):539–542.
Skoglund et al.
Supporting Information
Skoglund et al. 10.1073/pnas.1318934111
SI Materials and Methods
Postmortem Degradation Score Distributions in Published Datasets.
We obtained sequence data from three previous studies (1–3). We
remapped the data which originated from HiFi polymerase amplification by Skoglund et al. (3) using Burrows-Wheeler Aligner
(BWA) 0.5.9 (4) with the seed region disabled (parameters: −1
16500, −n 0.01, −o 2) and formed consensus sequences of reads
with identical outer mapping coordinates (5). We used the calmd
program in the samtools (6) suite to recompute the MD field
(containing alignment information, such as mismatches) for all
datasets. To obtain postmortem degradation score (PMDS)
distributions, we excluded positions with base quality < 20 and
clipped alignments and alignments with gaps (these filters were
used regardless of whether PMDS thresholds were applied). For
Neandertal data mapped with ANFO (1), we required a mapping
quality of 90, whereas we required a mapping quality of 30 for all
other datasets.
We mapped two previously obtained contaminated Neandertal
sets (7) to the rCRS using BWA with the same parameters as above.
Because the data were obtained using a 454 technology that did not
produce base quality scores, we assumed a base quality score of 20
at all positions. To estimate contamination, we used 7 fixed transversions as well as 70 fixed transitions between Neandertal and
modern human mitochondria [excluding positions where Neandertals had Thymine (T) and modern humans had Cytosine (C) and
where Neandertals had Adenine (A) and modern humans had
Guanine (G)]. This procedure was to increase the number of
informative positions for these low-coverage datasets; for highercoverage datasets, we restricted the analysis to only seven transversions (see below).
In Silico Contamination Experiments. In the mtDNA contamination
experiment, we did not exclude any reads because of mapping
quality but required a base quality of at least 20 for PMDS computation and a depth at each site of at least 20 sequence reads. We
estimated modern contamination in the Vindija 33.16 mtDNA by
identifying seven transversion polymorphisms that are fixed between 6 complete Neandertal mtDNAs (8, 9) and the mtDNAs of
311 modern humans (10, 11).
For the autosomal analysis, we pooled sequence data from
three Neolithic hunter-gatherers, randomly choosing a single
sequence read at positions where more than one aligned sequence
was present, and we successively added more contaminating
sequences from a present day French individual (20%, 50%, and
90%). We extracted SNPs using sequences with mapping quality
of at least 30 and bases with base quality of at least 30 both with
and without filtering for PMDS ≥ 3 (positions with base quality <
30 were excluded from PMDS computation). The number of
SNPs ranged from 46,492 (no contamination or PMDS threshold) to 464,920 (90% contamination and no PMDS threshold).
We used all SNPs to perform principal component analysis using
EIGENSOFT 4.0 (12) with a reference dataset of 504 individuals
from Europe and the Levant (13–15), which was haploidized (a
single allele was randomly chosen from each diploid individual)
like in ref. 16 before analysis. The French individual who was the
source of the sequence data used for artificial contamination was
excluded from the principal component analysis. One SNP from
each pair with r2 value ≥ 0.2 was excluded from the principal
component analysis. We performed Procrustes transformation to
the principal component 1–principal component 2 configuration
obtained using only the reference dataset like in ref. 3.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
Capture and Sequencing of Okladnikov 2 mtDNA. DNA was extracted
from about 200 mg bone from a humerus shaft found in Okladnikov
Cave in the Altai Mountains in the 1980s as described (17), with
one modification: the DNA elution from the silica particles was
done using TE (Tris-EDTA) buffer with 0.05% Tween 20. Siliconized tubes were used for long-term storage of the DNA extract.
For the sampling of the bone, dentist drill bits were used together
with an NSK E-Max Micromotor System. Before mtDNA enrichment, 25 μL DNA extract were turned into a sequencing library using a modified 454 library preparation protocol as
described (9, 18), with the exception that Illumina p5 and p7
adapters were used as described elsewhere (19). The p7 adapter
was modified to avoid contamination from other libraries carrying
a unique 7-bp barcode.
The primer extension capture (PEC) mtDNA enrichment
protocol was performed as described previously (9) using the exact
same four 144-plex and 143-plex mixes used to retrieve complete
Neandertal, Denisovan, and Pleistocene modern human mtDNAs
previously (7, 9, 20). The PEC primers are 5′-biotinylated and
consist of a universal 12-base 5′ spacer sequence (CAAGGACATCCG) followed by a specific primer sequence. To design the
specific primer sequences, Primer3 (21) was used to find all possible primer sites on the light strand of the Vindija 33.16 Neandertal mtDNA sequence (8).
After the final spin column purification into 50 μL TE buffer, the
PEC products were sequenced on the Illumina GAII platform. For
this purpose, a PCR primer pair was constructed that is complementary to the p5 and p7 adapters on the 3′ ends. The Illumina/
Solexa library was then amplified in a 100-μL reaction containing
50 μL Phusion High-Fidelity Master Mix, 500 nM each Solexa
primer, and 10 μL PEC product template. Annealing temperature
was 60 °C, and a total of 10 cycles of PCR was performed. The
amplified products were spin column-purified and quantified on an
Agilent 2100 Bioanalyzer DNA 1000 Chip. The PEC product was
diluted and sequenced according to Illumina GAII protocols on
a paired end run with a total of 76 cycles.
The sequencing run was analyzed starting from raw images using
the Illumina Genome Analyzer pipeline 1.3.2. The first five sequencing cycles were used for cluster identification. After standard
base calling, reads of the PhiX 174 control lane were aligned to the
corresponding reference sequence to obtain a training dataset for
the base caller Ibis (22). Raw sequences obtained from Ibis for the
two paired end reads of each sequencing cluster were merged
(including adapter removal), requiring at least 11-nt overlap between the two reads. For bases in the overlap, quality scores were
summed up. In cases where different bases were called, the base
with the higher-quality score was chosen.
SI Results
Comparison of PMDS Analysis with Simple Filtering Based on Mismatches.
We compared the fraction of retained authentic sequences as
a function of their enrichment relative to contamination for the
PMDS approach with a simpler previously suggested approach,
where a sequence is retained if it displays a C→T mismatch within,
for example, 1–15 bp of the 5′ terminus of the sequence (3). We
found that, compared with the basic method, the PMDS approach
offers better tradeoff between the enrichment and the fractionretained endogenous sequences as well as being able to enrich for
authentic DNA far beyond what is possible with the simple
method of observing a mismatch (Fig. S3). Whereas the basic
method is able to achieve its maximum 20-fold enrichment by
restricting to sequences with a mismatch in the first position
1 of 10
(assuming contamination similar to the 100-y-old Australian
sample and 40,000-y-old Neandertal endogenous sequences), the
PMDS method is able to achieve, for instance, a 5,000-fold enrichment while retaining ∼10% of the authentic sequence reads in
the same data (Fig. S3).
Model for Single-Stranded Libraries. In single-stranded library protocols, the only empirically observed mismatches caused by C deamination events are C → T mismatches at both ends of sequences
(23). However, these mismatches could be affected by singlestranded overhangs in the original template molecule from either
end. Unlike double-stranded library amplifications, it is not possible
to differentiate between deamination caused by 5′ and 3′ singlestranded overhangs in data from single-stranded library preparation.
We, thus, consider two possible processes giving rise to postmortem
nucleotide polymorphisms Dz and Dy, where z and y denote distance
to the 5′ and 3′ termini, respectively. Here, the probability of match
is then the sum of the following exclusive events: no damage (from
either end), no error, and no polymorphism; a misincorporation has
been reverted because of sequence error (two terms: one for Dz and
one for Dy); and finally, a polymorphism is reverted because of error
conditional on no misincorporation:
PðMatchjz; yÞ = ð1 − πÞ × ð1 − «Þ × ð1 − Dz Þ × 1 − Dy + ð1 − πÞ
× « × Dz × 1 − Dy + ð1 − πÞ × « × Dy × ð1 − Dz Þ
+ π × « × ð1 − Dz Þ × 1 − Dy :
[S1]
Adjusting Base Qualities for PMD. Selecting for nucleotide misincorporations amplifies a problem that already plagues ancient
genomics: genotyping short reads with nucleotide misincorporations. PMD-aware consensus sequence calling has been developed
for mtDNA sequencing studies (9) but has not been widely
adopted for genome-wide sequence data (24), and no software for
large-scale genomic data is currently available. One possibility is to
adjust the base qualities for the probability of PMD (25), which
allows their incorporation into all downstream analyses.
We incorporated the PMD model used here into genotyping by
probabilistic adjustment of the base quality scores (25) (SI Materials
and Methods). Given the parameter Dz of the rate of nucleotide
misincorporations, we can adjust the base quality scores according
to the position-specific probability of PMD (25). Specifically, C/G
reference sites, where a T/A is observed in the sequence read, could
be erroneously inferred as a true polymorphism because of either
a sequencing error or a nucleotide misincorporation. In the same
notation as shown in Materials and Methods, an adjusted phredscaled base quality score Qadjusted is then (25)
−Q=10
: [S2]
Qadjusted = − 10 × log10 1 − ð1−Dz Þ × 1 − 10
We tested the ability of this approach to alleviate problems with
nucleotide misincorporations using the Vindija 33.16 mitochondrial data (1). First, the sequence data were remapped to the revised Cambridge human reference sequence (rCRS) using BWA
with the same parameters as above, because ANFO alignments
contained a portion of soft clipped regions that would be ignored
during base quality adjustment. Second, we called genotypes for
the entire mitochondrion using samtools (v0.1.16) pileup (6) in
haploid mode, requiring a phred-scaled consensus quality of at
least 30. Third, we counted the fraction of genotypes where there
was C/T or G/A ambiguity, stratifying by different thresholds on
the PMDS statistic, and found that this approach substantially
increases the number of sites that can be reliably called (Fig. S9).
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
Analysis of the Okladnikov 2 mtDNA Assembly. Using positions that
separate the Okladnikov 2 majority allele from 311 modern human
mtDNAs (Material and Methods), we found that, of 1,318 informative
fragments, 149 fragments were putative contaminants (10.2%;
95% confidence interval = 8.7–11.7%), whereas for 3,908 sequences that had a positive PMDS, only 17 of 1,301 informative
fragments were putative contaminants (1.3%; 95% confidence
interval = 0.7–1.9%).
Average consensus support in the final assembly was 97.0%,
with 92% of all positions achieving support of 90% or higher. We
observed only one difference between the 15-fold coverage
consensus inferred using only sequences with PMDS ≥ 0 and the
25-fold unfiltered data: a T residue at the position corresponding
to rCRS position 1,845 in the PMDS ≥ 0 assembly, where the
unfiltered (contaminated) assembly displayed a C residue. We
aligned the consensus sequences of Okladnikov 2 to complete
mitochondrial sequences from 53 modern humans (10), 6 Neandertals (8, 9), 2 Denisovans (20, 26), 1 gorilla, 2 Bonobos, and 3
chimpanzees using MUSCLE (27) with default parameters. In this
alignment, we found that all other 67 mitochondria also displayed
the C residue, indicating that this position could be either private
to Okladnikov 2 or caused by C deamination. However, when
increasing the PMDS threshold to three, this position was not
reliably called. Hence, we took the conservative approach of excluding this position from the final consensus. We also observed
one position that differed between the Okladnikov 2 sequence
obtained in this study and the sequence previously obtained by
targeting the mtDNA control region detailed in the work by Krause
et al. (28). At this position (corresponding to 16,148 in rCRS), the
consensus called was a C, whereas the allele determined previously
was a T.
To investigate the effect on the consensus calls of the 10%
contamination rate in the unfiltered data, we repeated the procedure with the full contaminated data and investigated at which
positions we had either gained or lost a reliable call by restricting
to sequences with PMDS ≥ 0. There were 88 of 16,566 positions
that were not reliably called for the PMDS ≥ 0 data (0.53%)
compared with 24 of 16,563 positions in the unfiltered data
(0.14%). In total, seven called sites were gained by restricting to
PMDS ≥ 0, all of which were polymorphic or fixed differences in
a set of mtDNA sequences from 6 Neandertal and 53 modern
humans (Table S1). In contrast, among the calls that were lost by
PMDS ≥ 0, only 4 of 68 sites (5.9%) were polymorphic or fixed
differences. These observations are consistent with the prediction that the removal of contamination by the PMDS approach
enables base calling at evolutionary informative sites that are
otherwise ambiguous because of both contaminant and endogenous alleles being present. In contrast, the reduction from 25to 15-fold coverage in the final PMDS ≥ 0 assembly resulted in
random loss of sites without bias for those sites that are polymorphic or fixed differences in the set of 60 hominin mtDNAs.
To investigate the effect of reference sequence choice for
assembly, we also aligned Okladnikov 2 sequences to the rCRS
and observed two differences. The first difference is a CA dinucleotide repeat starting at position 514 in rCRS that displays
from three (in Gorilla and the six previously analyzed Neandertals) to seven repeats (in an Aboriginal Australian; AF346965)
in our alignment of 67 hominin mitochondria. Here, the Vindija
33.16 assembly had three repeats, consistent with the state in the
six previously analyzed Neandertals (but not Denisovans, who
have four repeats), whereas the rCRS assembly had five repeats,
a state found in several modern human mtDNAs. The second
difference is at position 16,263 in rCRS: the Vindija 33.16-based
assembly is undetermined at two positions, where the rCRS-based
assembly displays CT. After these two position, the six Neandertals
previously analyzed also display an A insertion not seen in other
sequences. The positions are further complicated by the PCRamplified region discussed in the work by Krause et al. (28), which
2 of 10
shows that the sequence CTA differs from the rCRS-based assembly. We, therefore, chose the conservative approach of masking
these two positions in the final assembly based on Vi33.16 and
excluding the possible A insertion.
Because the relationship between Okladnikov 2, Mezmaiskaya
1, and the five Western Neandertal mtDNAs differed from previous studies, we replicated the phylogenetic analysis using a 304bp alignment sequenced in several additional Neandertals (28–36)
and added three modern human sequences: Denisova 1, chimpanzee, and Bonobo sequences. In this analysis of 304 bp (Fig.
S6), the deeper divergences between Neandertal mtDNAs remain
unresolved, similar to the result found in ref. 28. However, the
Okladnikov 2 sequence obtained here is nearly identical with the
fragment sequenced by a previous study (28) for 304 bp, showing
that the uncalled positions in the new sequence are not responsible for the well-supported topology of the Mezmaiskaya 1
being basal to the Okladnikov 2 sequence and the other complete
Neandertal mtDNA genomes.
Validation of the Neandertal mtDNA Topology. To confirm that the
topology of Neandertal mtDNA genomes, where Mezmaiskaya 1
is basal to Western Neandertals and Okladnikov 2, was not affected by the PMDS filtering of Okladnikov 2, we obtained
shotgun sequences from Mezmaiskaya 1, Vindija 33.16, Vindija
33.25, and Vindija 33.36 from the study by Green et al. (1). We
realigned mtDNA sequences from these four individuals using
BWA with identical parameters as for Okladnikov 2 and downsampled these four datasets and the Okladnikov 2 data to exactly
2,500 randomly chosen sequences. We then excluded sequences
with a PMDS below zero and called consensus sequences using
the mpileup and vcf2fq tools in the samtools suite with default
parameters. We reconstructed an mtDNA sequence phylogeny
exactly as in the main analysis and found that the topology of
Neandertal mtDNA sequences was exactly the same as in the main
analysis, with the Mezmaiskaya 1 sequence basal to the Okladnikov
2 sequence and the Vindija Neandertals (Fig. S7).
1. Green RE, et al. (2010) A draft sequence of the Neandertal genome. Science 328(5979):
710–722.
2. Rasmussen M, et al. (2011) An Aboriginal Australian genome reveals separate human
dispersals into Asia. Science 334(6052):94–98.
3. Skoglund P, et al. (2012) Origins and genetic legacy of Neolithic farmers and huntergatherers in Europe. Science 336(6080):466–469.
4. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25(14):1754–1760.
5. Kircher M (2012) Analysis of High-Throughput Ancient DNA Sequencing Data.
Ancient DNA (Springer, Berlin), pp 197–228.
6. Li H, et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics
25(16):2078–2079.
7. Krause J, et al. (2010) A complete mtDNA genome of an early modern human from
Kostenki, Russia. Curr Biol 20(3):231–236.
8. Green RE, et al. (2008) A complete Neandertal mitochondrial genome sequence
determined by high-throughput sequencing. Cell 134(3):416–426.
9. Briggs AW, et al. (2009) Targeted retrieval and analysis of five Neandertal mtDNA
genomes. Science 325(5938):318–321.
10. Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation
and the origin of modern humans. Nature 408(6813):708–713.
11. Green RE, et al. (2006) Analysis of one million base pairs of Neanderthal DNA. Nature
444(7117):330–336.
12. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS
Genet 2(12):e190.
13. Li JZ, et al. (2008) Worldwide human relationships inferred from genome-wide
patterns of variation. Science 319(5866):1100–1104.
14. Surakka I, et al. (2010) Founder population-specific HapMap panel increases power in
GWA studies through improved imputation accuracy and CNV tagging. Genome Res
20(10):1344–1351.
15. Altshuler DM, et al. (2010) Integrating common and rare genetic variation in diverse
human populations. Nature 467(7311):52–58.
16. Skoglund P, Jakobsson M (2011) Archaic human ancestry in East Asia. Proc Natl Acad
Sci USA 108(45):18301–18306.
17. Rohland N, Hofreiter M (2007) Comparison and optimization of ancient DNA
extraction. Biotechniques 42(3):343–352.
18. Margulies M, et al. (2005) Genome sequencing in microfabricated high-density
picolitre reactors. Nature 437(7057):376–380.
19. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly
multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):t5448.
20. Krause J, et al. (2010) The complete mitochondrial DNA genome of an unknown
hominin from southern Siberia. Nature 464(7290):894–897.
21. Rozen S, Skaletsky H (1999) Primer3 on the WWW for General Users and for Biologist
Programmers. Bioinformatics Methods and Protocols (Springer, Berlin), pp 365–386.
22. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome
Analyzer using machine learning strategies. Genome Biol 10(8):R83.
23. Meyer M, et al. (2012) A high-coverage genome sequence from an archaic Denisovan
individual. Science 338(6104):222–226.
24. Rasmussen M, et al. (2010) Ancient human genome sequence of an extinct PalaeoEskimo. Nature 463(7282):757–762.
25. Jónsson H, Ginolhac A, Schubert M, Johnson PL, Orlando L (2013) mapDamage2.0: Fast
approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics
29(13):1682–1684.
26. Reich D, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave
in Siberia. Nature 468(7327):1053–1060.
27. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res 32(5):1792–1797.
28. Krause J, et al. (2007) Neanderthals in central Asia and Siberia. Nature 449(7164):
902–904.
29. Krings M, et al. (1997) Neandertal DNA sequences and the origin of modern humans.
Cell 90(1):19–30.
30. Krings M, et al. (2000) A view of Neandertal genetic diversity. Nat Genet 26(2):144–146.
31. Ovchinnikov IV, et al. (2000) Molecular analysis of Neanderthal DNA from the
northern Caucasus. Nature 404(6777):490–493.
32. Lalueza-Fox C, et al. (2006) Mitochondrial DNA of an Iberian Neandertal suggests
a population affinity with other European Neandertals. Curr Biol 16(16):R629–R630.
33. Lalueza-Fox C, et al. (2005) Neandertal evolutionary genetics: Mitochondrial DNA
data from the iberian peninsula. Mol Biol Evol 22(4):1077–1081.
34. Orlando L, et al. (2006) Revisiting Neandertal diversity with a 100,000 year old mtDNA
sequence. Curr Biol 16(11):R400–R402.
35. Caramelli D, et al. (2006) A highly divergent mtDNA sequence in a Neandertal
individual from Italy. Curr Biol 16(16):R630–R632.
36. Dalén L, et al. (2012) Partial genetic turnover in neandertals: Continuity in the East
and population replacement in the West. Mol Biol Evol 29(8):1893–1897.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
3 of 10
Fig. S1. Nucleotide misincorporation patterns arising from C deamination in ancient DNA sequences. (A and B) Empirical patterns in datasets analyzed in this
study. (C and D) Rate of deamination assumed in the PMDS model in this study.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
4 of 10
0.200 0.500
0.020 0.050
0.005
0.001
C to T fraction at first position
Late Pleistocene Neandertals
Neolithic Scandinavians
100 year old Aborigine
HGDP cell lines
1
10
100
1000
10000
Age of sample
Fig. S2. Relationship between age and observed deamination fraction (C to T mismatches) in the 5′ end of sequence data analyzed in this study. Blue,
Neolithic Scandinavians; gray, HGDP cell lines from present day individuals; green, 100-y-old Australian Aborigine hair sample; red, Pleistocene Neandertals.
Note the log scale for both the x and y axes.
Fig. S3. (A) The amount of retained data as a function of different thresholds on PMDS. (B) The enrichment of ancient over more recent sequence reads as
a function of chosen PMDS threshold. In our example data, a PMDS threshold of ∼2 would achieve a 100-fold enrichment of Neandertal and Neolithic
Scandinavian sequences over present day (contaminating) sequences, and a PMDS threshold of ∼6 would achieve the same 100-fold enrichment when contaminating the Neandertal or Neolithic Scandinavian data with similar sequences from the 100-y-old remains of the Australian individual.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
5 of 10
0
5
10
15
20
25
Distance from 5' end of sequence read
30
0.5
0.1
0.2
0.3
0.4
A
others
0.0
Mismatch frequency in read given G in ref.
0.4
0.1
0.2
0.3
T
others
0.0
Mismatch frequency in read given C in ref.
Fig. S4. Inferred gene genealogy of contaminated Mezmaiskaya 2 mtDNA sequence data (A) before and (B) after PMDS filtering. A consensus mtDNA sequence for Mezmaiskaya 2 was called using the haploid model of samtools pileup either using all sequences or restricting to sequences with a PMDS ≥ 4. The
consensus sequences were subsequently aligned to mtDNA genome sequences from seven modern humans, six Neandertals, two Denisovans, one Bonobo, one
chimpanzee, and one gorilla. A neighbor-joining tree was reconstructed for both datasets.
−30
−25
−20
−15
−10
−5
0
Distance from 3' end of sequence read
Fig. S5. Nucleotide misincorporation patterns in Okladnikov 2 mtDNA sequences. We show the fraction of given nucleotides in the Okladnikov 2 sequence
data at positions where the rCRS reference sequence had (Left) a C residue or (Right) an A residue.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
6 of 10
1000
800
600
400
0
200
Number of positions
0
10
20
30
40
50
Depth
Fig. S6. Sequencing depth distribution for the Okladnikov 2 mitochondrial genome. Only sequences with PMDS ≥ 0 were used.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
7 of 10
Fig. S7. The position of Okladnikov 2 in the Neandertal mtDNA tree topology is not biased by the PMDS filtering. Mitochondrial sequences from Okladnikov
2, Mezmaiskaya 1, Vindija 33.16, Vindija 33.25, and Vindija 33.26 were down-sampled to 2,500 randomly sampled sequences, of which sequences with negative
PMDSs were excluded. A Bayesian tree was reconstructed, and the Neandertal mtDNA sequence tree topology was identical to the one in the main analysis,
where only Okladnikov 2 was subjected to PMDS filtering.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
8 of 10
Fig. S8. Phylogenetic analysis of the hypervariable region. The Okladnikov 2 sequence obtained here falls unambiguously with the previously sequenced short
fragment. However, the relationship with Mezmaiskaya 1 is not resolved based on this restricted region. A is restricted to sequences covering 304 bp, whereas
B also includes the partial Scladina and Teshik Tash sequences (for which the flanking missing sequence was marked as missing data).
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
9 of 10
5
Fig. S9. Comparison of PMDS filtering with a basic method of extracting reads with a C→T/G→A mismatch in the terminus. The PMDS results are based on
PMDS thresholds of 0 (least enrichment) to 10 (greater enrichment). The basic method results are based on extracting reads with a C→T mismatch at most n bp
from the 5′ terminus, where n ranged from 15 (less enrichment) to 1 (greatest enrichment). Higher values of n result in less-efficient enrichment. For both
methods, we excluded bases with quality < 20.
Fig. S10. Genotyping ancient DNA using base qualities adjusted for PMD. Fraction C/T and G/A genotypes obtained using standard samtools pileup haploid
genotyping (gray) and base qualities adjusted using a model of nucleotide misincorporation (black) are displayed.
Table S1. Characterization of base calls gained and lost in the PMDS ≥ 0 assembly of Okladnikov 2
State in 6 Neandertal and 53 modern
human mtDNA sequences
Calls gained in the
PMDS ≥ 0 assembly
Calls lost in the
PMDS ≥ 0 assembly
Not polymorphic
Fixed differences
Polymorphic in Neandertals only
Polymorphic in modern humans only
Shared polymorphisms
Total polymorphisms
0
2
1
3
1
7/7 (100%)
64
2
0
1
1
4/68 (5.9%)
We present the number of positions for each category that is polymorphic in a set of 6 previously published Neandertal mtDNAs and
53 modern human mtDNAs.
Skoglund et al. www.pnas.org/cgi/content/short/1318934111
10 of 10