Download Separating endogenous ancient DNA from modern day

Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal Pontus Skoglunda,1, Bernd H. Northoffb,2, Michael V. Shunkovc, Anatoli P. Dereviankoc, Svante Pääbob, Johannes Krauseb,d, and Mattias Jakobssona,e a Department of Evolutionary Biology and eScience for Life Laboratory, Uppsala University, 75236 Uppsala, Sweden; bDepartment of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, 04103 Leipzig, Germany; cPalaeolithic Department, Institute of Archaeology and Ethnography, Russian Academy of Sciences Siberian Branch, Novosibirsk 630090, Russia; and dInstitute for Archaeological Sciences, University of Tuebingen, 72070 Tuebingen, Germany Edited by Richard G. Klein, Stanford University, Stanford, CA, and approved December 27, 2013 (received for review October 9, 2013) | human evolution R etrieval and sequencing of DNA from fossil material has been revolutionized by the rapid surge of high-throughput sequencing technology (1). One of the advantages of the new generation of sequencing technologies over previous PCR-based approaches is the routine sequencing of molecules across their entire length in contrast to specific priming sites. Although this approach has been extended to ancient material from a number of organisms, there is particular interest in genomic analyses of ancient human populations (2–9). However, despite extensive precautions against contamination during laboratory sample preparation, many fossil samples show evidence of contamination from not only the microenvironment of the fossil, but also present day humans (2, 10–14). Although the former type of contamination can introduce statistical noise in comparisons of ancient DNA with modern populations, the latter type of contamination (from present day humans) can impose major biases on population genetic and phylogenetic analyses (15, 16). Ancient DNA sequences show several features indicative of postmortem degradation (PMD) (17). The main diagnostic feature documented so far is a pattern of Cytosine (C) →Thymine (T) substitutions (18–20) that increases toward the 5′ end of the sequence reads (21), which in most applications, results in a complementary Guanine (G) →Adenine (A) pattern in the 3′ end caused by enzymatic repair (5). This pattern has been attributed to Cytosine deamination at single-stranded ends of the molecules (21) and shows a clear tendency of increase over time in contrast to other potential diagnostic features, such as fragment length and preferential fragmentation at purines (22, 23). However, although the observation of such a pattern suggests the presence of degraded DNA in a sequence dataset, it does not prove that modern contamination is absent or even at low frequency. For high-coverage data from extinct close relatives of modern humans, an efficient approach to estimate the contamination proportion is to leverage SNPs that are known to be fixed or nearly fixed in present populations (2, 24). If it can be assumed that contamination is minor www.pnas.org/cgi/doi/10.1073/pnas.1318934111 Significance Strict laboratory precautions against present day human DNA contamination are standard in ancient DNA studies, but contamination is already present inside many ancient human fossils from previous handling without specific precautions. We designed a statistical framework to isolate endogenous ancient DNA sequences from contaminating sequences using postmortem degradation patterns and were able to reduce high-contamination fractions to negligible levels. We captured DNA sequences from a contaminated Neandertal bone from Okladnikov Cave in Siberia and used our method to assemble its mitochondrial genome sequence, which we find to be from a lineage basal to five of six previously published complete Neandertal mitochondrial genomes. Our method paves the way for the large-scale genetic analysis of contaminated human remains. Author contributions: P.S., S.P., J.K., and M.J. designed research; P.S., B.H.N., S.P., and J.K. performed research; M.V.S. and A.P.D. contributed new reagents/analytic tools; P.S. analyzed data; and P.S., S.P., J.K., and M.J. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Data deposition: The sequence reported in this paper has been deposited in the GenBank database (accession no. KF982693). 1 To whom correspondence should be addressed. E-mail: [email protected]. 2 Present address: Institute of Laboratory Medicine, Ludwig Maximilians University, 81377 Munich, Germany. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1318934111/-/DCSupplemental. PNAS | February 11, 2014 | vol. 111 | no. 6 | 2229–2234 GENETICS paleogenomics relative to endogenous DNA, this approach can also be extended to high-coverage data from ancient modern humans by investigating whether more alleles are present than expected from the ploidy level of the chromosome (4, 24, 25) or making assumptions about the ancestry of both ancient and contaminating individuals (4). For low-coverage data, stratifying sequences based on whether a mismatch consistent with PMD degradation have occurred, and investigating the consistency of population genetic patterns across categories can be used to investigate whether contamination is present at such levels as to affect the conclusions of the analysis (8). However, these approaches only allow investigation of whether contamination is present. To our knowledge, no approach has been developed to overcome significant contamination levels, except for the case of targeted PCR-based sequencing of haploid regions, for which most of the informative termini of the fragments are lost (26). This situation has limited large-scale ancient DNA analyses of human specimens to those specimens with very low levels of present day DNA contamination, leaving the genetic information in potentially crucial fossil material often unstudied. Here, we propose a likelihood approach that incorporates sequence errors and alignment to the reference genome to provide a score that is informative on whether a given sequence is likely to have arisen from a degraded template molecule. We show that this method allows reduction of high contamination rates down to levels that are negligible for most evolutionary purposes and give examples using both mitochondrial (mt) and ANTHROPOLOGY One of the main impediments for obtaining DNA sequences from ancient human skeletons is the presence of contaminating modern human DNA molecules in many fossil samples and laboratory reagents. However, DNA fragments isolated from ancient specimens show a characteristic DNA damage pattern caused by miscoding lesions that differs from present day DNA sequences. Here, we develop a framework for evaluating the likelihood of a sequence originating from a model with postmortem degradation—summarized in a postmortem degradation score—which allows the identification of DNA fragments that are unlikely to originate from present day sources. We apply this approach to a contaminated Neandertal specimen from Okladnikov Cave in Siberia to isolate its endogenous DNA from modern human contaminants and show that the reconstructed mitochondrial genome sequence is more closely related to the variation of Western Neandertals than what was discernible from previous analyses. Our method opens up the potential for genomic analysis of contaminated fossil material. genome-wide data from the literature. Finally, we produced a high-coverage mtDNA dataset from a Siberian Neandertal fossil excavated in Okladnikov Cave that is contaminated with a substantial amount of modern human DNA and use our method for separating endogenous sequences from contaminating sequences to reconstruct its near-complete mitochondrial genome sequence. A Results B Model Outline. The diagnostic nucleotide misincorporation pattern that arises from ancient DNA damage can be detected only by observing matches or mismatches in an alignment with one or more reference sequences. In the 5′ end of a sequence fragment, as many as 30–40% of C residues can appear as T in DNA fragments from samples that are several thousand years old under typical preservation conditions (21, 23). This fraction reflects both Cs that are deaminated to Uracil, which are read as T by DNA polymerases, and methylated Cs that are deaminated to Ts. Whether such mismatches are observed, thus, provides information on the authenticity of a given DNA fragment. However, C→T and G→A mismatches appear not solely because of ancient DNA damage but also, because of biological polymorphisms and sequencing errors. We developed a likelihood framework that explicitly models these three processes (Material and Methods). To investigate authenticity of a given DNA fragment, we use this framework to evaluate two competing models, of which one assumes ancient DNA degradation and the other does not, arriving at a final log-likelihood ratio of the two models that we term a PMD score (PMDS). A positive PMDS for a DNA fragment indicates support for the ancient DNA model relative to the alternative model, and the greater the PMDS, then the greater the support for the DNA fragment having been subject to DNA degradation. Empirical Distributions of PMD Scores in Ancient and Present Day Samples. We computed PMD scores for 1 million randomly sampled sequences from the 100-y-old remains of an Australian individual (27), four ∼5,000-y-old Neolithic Scandinavian individuals (8), four 38,000- to 70,000-y-old Neandertal individuals (24), and four present day individuals (24). These individuals showed evidence of PMD roughly proportional to their age (Figs. S1 and S2), and four PMDS distributions from representative individuals are shown in Fig. 1. Most importantly, we find that sequences from ancient individuals have an excess of high PMD scores, implying that a set of ancient DNA sequences shows enough PMD to be unlikely to be mistaken for contaminants. For example, looking at the cumulative distribution of PMDS, we find that, for a threshold of PMDS > 5, we would retain ∼15–20% of sequences from Neandertals and Neolithic Scandinavians but only ∼0.01–0.02% from any present day individual and 0.27% from the 100-y-old remains of the Australian individual (note that this sample is derived from hair and thus, may have marginally decreased levels of DNA damage) (28) (Fig. S3). Thus, an empirical probability of a sequence being from a present day (contaminating) source could, in principle, be extrapolated and propagated in downstream genotyping inference, much like the common practice for sequence error estimates. However, this procedure is not straightforward, because the (usually unknown) initial level of contamination in the library affects the final contamination fraction associated with each given PMDS threshold. Thus, we will focus on imposing hard PMDS thresholds to contaminated datasets to be able to estimate the contamination fraction among the sequences downstream of genotyping. Artificial Contamination Experiments. To investigate the efficiency of the PMDS approach for genotyping high-coverage data, we created mosaic datasets of mitochondrial sequences from a French present day individual and the Vindija 33.16 Neandertal individual. These contaminated data were filtered using PMDS thresholds of three or five and compared with the data with no filtering. We estimated contamination as the fraction of modern human-specific alleles using seven diagnostic transversion SNPs and found that the unfiltered data show increasing rates of modern human alleles with increasing contamination input (as expected). In contrast, 2230 | www.pnas.org/cgi/doi/10.1073/pnas.1318934111 C D Fig. 1. PMDS distribution in ancient and present day human sequences. (A) A present day human. (B) Hundred-year-old remains of an Australian individual. (C) Five thousand-year-old remains of a Scandinavian individual. (D) A Pleistocene Neandertal. Note that the x axes in C and D are truncated, and PMD scores reach ∼26. a PMDS threshold of five results in no detectable contamination postfiltering (0.0%) for prefiltering contamination levels (per base pair) from 1% to ∼93% (Fig. 2A). To illustrate the use of the approach for autosomal low-coverage data, we performed a similar in silico contamination experiment using genome-wide sequence data from the Neolithic hunter-gatherers, successively adding French sequences to final contamination fractions of 20%, 50%, and 90%. We extracted SNPs typed in 504 individuals from Europe and the Levant (SI Materials and Methods) from these contaminated datasets as well as the uncontaminated dataset, performed principal component analysis for each dataset separately, and merged the obtained principal component 1 (PC1) –PC2 configurations using Procrustes transformation (8). As in the original study (8), we find that the unmodified hunter-gatherer data are outside the distribution of modern day populations but most similar to Northern European populations. As we artificially add French contamination, the ancient data are inferred to be successively more similar to the French population. When we apply a PMDS threshold of three to the sequences, no effect of contamination can be seen, with the PC1–PC2 configuration of all filtered datasets showing very similar projections as the uncontaminated (and unfiltered) data (Fig. 2B). Contamination Reduction Using PMD Scores in Genuinely Contaminated Datasets. Whereas the above examples are designed to test the efficiency of the method in common genetic analyses, the controlled nature of our artificial contamination experiments allows us to examine the contamination rate as a function of PMDS threshold directly. We confirm that present day French contamination in the Vindija 33.16 Neandertal will be negligible for PMDS ≥ 4 (Fig. 3A). Similarly, for data with a less extreme temporal difference (the 100y-old remains of an Australian individuals as the contaminant and the 5,000-y-old remains of a Scandinavian hunter-gatherer as endogenous DNA), contamination levels as high as 95% can be Skoglund et al. B A Mitochondrial Genome Sequence of a Contaminated Neandertal Specimen from Okladnikov Cave, Siberia. Because the PMDS approach provides a tool to substantially reduce contamination levels in ancient DNA, we used primer extension capture (29) to obtain 6,906 unique mtDNA sequences from a Neandertal sample from Okladnikov Cave (Okladnikov 2), in which contamination A B limited previous DNA analysis to a small mt region that could be amplified with Neandertal-specific primers sets (30). We found clear evidence of PMD consistent with endogenous DNA (Fig. S5), and 10.2% (95% confidence interval = 8.7–11.7%) of all fragments overlapping diagnostic positions displayed the modern day human allele, indicative of contamination. However, when we restricted the analysis to 3,908 sequences that had a PMDS ≥ 0, the point estimate of contamination was reduced to 1.3% (95% confidence interval = 0.7–1.9%) (Fig. 3C). We, thus, discarded sequences with PMDS < 0 and assembled the mitochondrion of Okladnikov 2 using an ancient DNA-aware iterative mapping assembler (31). Average sequence depth for this assembly was 15fold compared with 25-fold in the unfiltered data, with a total of 96.0% of all positions covered by at least five fragments and 99.4% covered by at least two fragments (Fig. 4 and Fig. S6). There were 88 positions that were not reliably called for the PMDS ≥ 0 data (0.53%) compared with 24 positions in the unfiltered data (0.14%). However, we find that the seven sites that were gained when restricting to PMDS ≥ 0 were all polymorphic and thus, evolutionary informative (Table S1). In contrast, only 4 of 68 sites that were lost were polymorphic. This observation shows that PMDS filtering can resolve positions that are difficult to call C Fig. 3. Reduction of artificial and genuine contamination using PMDSs. (A) Expected contamination fraction in the Vindija 33.16 Neandertal as a function of the PMDS threshold using sequence data from a present day French individual as the artificial contaminant. (B) Expected contamination fraction in the Ajv70 Scandinavian hunter-gatherer as a function of the PMDS threshold using sequence data from the 100-y-old remains of an Australian individual as the artificial contaminant. The different colored lines in A and B correspond to different contamination levels (before PMDS filtering, the initial contamination fraction is given at PMDS = −2). (C) Estimated modern human contamination in three genuinely contaminated Neandertal datasets from Feldhofer 2, Mezmaiskaya 2, and Okladnikov 2. Skoglund et al. PNAS | February 11, 2014 | vol. 111 | no. 6 | 2231 ANTHROPOLOGY reduced to negligible levels by increasing the threshold on the PMDS statistic (Fig. 3B). However, because the data used for our artificial contamination experiment may not perfectly represent genuine cases of contamination, we also investigated two genuinely contaminated Neandertal mtDNA datasets (10): Feldhofer 2 and Mezmaiskaya 2. In line with the artificial contamination experiment above, we find strong reductions from the initial contamination level of ∼79% (Feldhofer 2) and ∼45% (Mezmaiskaya 2) for both these datasets (Fig. 3C) (10). Moreover, we find that consensus sequences from the original (and contaminated) Mezmaiskaya 2 dataset falls within the variation of modern humans, whereas the consensus of sequences post-PMDS filtering falls within Neandertal variation (Fig. S4), similar to results from a less contaminated sample from the same individual (29). GENETICS Fig. 2. Artificial mtDNA and autosomal contamination experiments. (A) Artificial contamination represented by sequences from a present day human (French) was added to a sequence dataset from a Croatian Neandertal (Vindija 33.16). Final per base pair contamination was estimated using diagnostic mtDNA positions that differentiate Neandertals from modern humans. Up to 90% per base pair contamination levels can be reduced to negligible levels using the PMDS approach. (B) Artificial contamination represented by sequences from a present day human (French) was added to a sequence dataset from Neolithic Scandinavian huntergatherers. SNPs were extracted with and without filtering for PMDS, and a principal component analysis together with European and Levantine populations was performed for the different levels of contamination. The PCs were Procrustes transformed for each separate analysis (8). A B C Fig. 4. A mt genome sequence from a contaminated Neandertal sample from Okladnikov Cave, Siberia. (A) Coverage distribution and sequence features of the Okladnikov 2 mt genome. (B) Gene genealogy of 67 published complete hominid mitochondria and Okladnikov 2. Node support is only indicated for posterior probabilities ≥ 0.75. (C) Geographical sampling location of Neandertal and Denisovan individuals from which mtDNA genomes have been sequenced. Okladnikov Cave is marked by a red square. GC, Guanine-Cytosine content. because of different alleles in the contaminant and endogenous sequences (SI Results). We aligned the Okladnikov 2 mt genome sequence to complete mt genomes from other archaic and modern humans and reconstructed a gene tree using a Bayesian approach (Material and Methods). We find that the Okladnikov 2 sequence falls outside all published complete Neandertal mt genomes from Central and Western Europe with high posterior probability but forms a cluster with these mt genomes to the exclusion of the mtDNA of Mezmaiskaya 1 from the Caucasus (Fig. 4). This topology is also obtained when a PMD filter is also applied to Mezmaiskaya 1, Vindija 33.16, Vindija 33.25, and Vindija 33.26 sequences (Fig. S7). The mtDNA genome sequence reconstructed here, thus, resolves the relationship between Okladnikov 2 and other Neandertal mt genomes, because analyses based on the control region have suffered from statistical uncertainty but indicated that the Okladnikov 2 mtDNA is basal to Mezmaiskaya 1 (30, 32), a result that we replicated by extracting a 304-bp region of the Okladnikov 2 mtDNA sequence assembly that overlapped with previous data (Fig. S8). 2232 | www.pnas.org/cgi/doi/10.1073/pnas.1318934111 Discussion Despite that modern human contamination has been shown to be present in a large proportion of archaeological human samples (2, 11, 33), large-scale DNA sequence retrieval has so far been limited to hominin specimens with very little contamination. Here, we developed a likelihood framework for large-scale ancient DNA studies that tries to gauge how likely a single sequence is to have originated from an ancient template molecule, incorporating base quality scores and biological polymorphism. We have shown that this method allows reduction of contamination in degraded ancient human sequence data to negligible levels. In contrast to a previously suggested computational approach for PCR-based amplification of targeted regions (26), our approach is particularly geared for the new generation of highthroughput sequencing approaches (1). This framework offers many advantages over previous approaches, which aimed at detecting contamination by dividing sequences based on fragment length or observed mismatches (8, 15). Most notably, information is provided by not only mismatches Skoglund et al. Skoglund et al. still be possible to investigate how many source individuals are represented using mtDNA, but because mtDNA/nuclear DNA ratios can vary greatly between samples (16), the most robust approach will likely be to stratify sequences based on the PMDS approach described here and assess whether population genetic inferences are consistent across different thresholds (cf. ref. 8 with Fig. 2). If the fraction of contamination can be estimated, we suggest that a PMDS threshold should be chosen such that contamination is negligible. If data are too sparse for such estimates, we suggest that main conclusions from the data should be robust for a range of PMDS thresholds. Whereas we have focused on human population genetic examples, contamination from present day sources also affects nonhuman ancient DNA studies of pathogens, ancient microbiomes, animals, and plants, and thus, computational removal of contamination in ancient DNA data is likely to increase in importance as these fields move to high-throughput sequencing approaches. PðMatchjzÞ = ð1 − πÞ × ð1 − «Þ × ð1 − Dz Þ + ð1 − πÞ × « × Dz + π × « × ð1 − Dz Þ [1] under a model of PMD (MPMD), which under a model of no PMD (MNULL, where Dz = 0 for all z), becomes P(Matchjz) = (1 − π) × (1 − «) + « × π. The probability of a C→T mismatch (or a G→A mismatch) for a particular site is then any other event or combination of events, such that PðMismatchjzÞ = 1 − PðMatchjzÞ: [2] In all applications here, we will assume that polymorphisms occur at a rate of π = 0.001 to approximate autosomal genetic variation between a pair of human chromosomes. We obtain « from the phred-scaled base qualities Q, but because we restrict to a specific type of mismatch at each type of site (C→T or G→A), we divide the total probability of a sequence error given by the base qualities by three, such that « = 1/3 × 10−Q/10. For the probability of postmortem nucleotide misincorporation, we use empirically observed patterns as the basis (21), and we approximate the probability by Dz = ð1 − pÞz−1 × p + C, [3] where z is the distance from the relevant sequence terminus (the 5′ end for C→T mismatches and the 3′ end for G→A mismatches), and the first position has z = 1. In this study, we assumed P = 0.3 and C = 0.01, which are consistent with most sequence datasets that are several thousand years old (Fig. S1), PNAS | February 11, 2014 | vol. 111 | no. 6 | 2233 GENETICS Materials and Methods Likelihood-Based Framework for Separating Ancient DNA from Present Day Contamination. C deaminations in ancient DNA sequences can be identified by mismatches to the reference sequence of two specific types (18–20). We, therefore, restrict our model to two categories of sites in a pairwise alignment between a DNA fragment and a reference sequence: (i) where the reference has a C and the aligned fragment has either a C or a T and (ii) where the reference base is G and the aligned base is either a G or an A. Other positions in the alignment are not considered, except for providing the distance z to the 5′ (in the case of a C in the reference) and 3′ (in the case of a G in the reference) termini. For obtaining z, positions in the alignment where there is a gap in the sequenced DNA fragment are not considered. In SI Materials and Methods, we outline a modification of the model for libraries prepared using single-stranded methods (e.g., ref. 5). There are three nonmutually exclusive events that can cause an observation of a C→T or a G→A mismatch at a given position in a sequence read where the reference sequence displays a C or a G: (i) a true biological polymorphism (occurring at rate-π), (ii) a sequence error (rate-«), or (iii) in the case of degraded DNA, PMD (rate Dz), which we will assume varies according to a modified geometric distribution (plus a small constant) across the sequenced molecule with decreasing probability with distance z (measured in base pairs) from the relevant terminus (21) (Fig. S1). Alternatively, observing a C-C match could be caused by a sequence error having reverted a postmortem nucleotide misincorporation or a sequence error having reverted a biological polymorphism. The third combination of events that could be imagined—a nucleotide misincorporation having reverted a biological polymorphism—does not enter here, because nucleotide misincorporations only result in substitutions in one direction (i.e., C→T or G→A but not T→C or A→G). Considering all three mutually exclusive possibilities together, we have ANTHROPOLOGY typical of PMD (8) but also, matches to the reference. For example, observing several C-C matches in the terminus of a fragment provides support for the model without PMD, which could, for example, allow the exclusion of contaminant fragments that have a single C→T mismatch because of sequencing error or biological polymorphism. Furthermore, probabilistic weighting of each observed match or mismatch according to its position along the sequenced fragment allows balanced consideration of the evidence for PMD across the entire fragment. For example, observing a single C→T mismatch at the third 5′ position of the fragment may provide as much evidence for the sequence fragment being of ancient origin as observing several such mismatches at a more central position of the DNA fragment. Finally, explicit consideration of the probability of sequencing error at each informative position removes the need for strict base quality cutoffs. Accordingly, we found that simply conditioning on observing a C→T mismatch in the sequence (8) does not achieve as high of a degree of reduction in contamination rate as applying a PMDS threshold (SI Results and Fig. S9). We also note that we, for consistency, assumed a C deamination model corresponding to ∼30% C to T substitutions in the first 5′ positions throughout this study (Fig. S1), but in cases where endogenous DNA is more degraded, additional efficiency might be gained by modifying this model (optional in our software implementation PMDtools). One cause of potential concern is if either contaminant or endogenous DNA is closer to the reference sequence than the other, in which case the number of mismatches caused by biological polymorphisms is not the same. In this paper, we assume equal divergence from the reference for the two models, because the observed biological polymorphism is, in most cases, much smaller than the rate of PMD; however, the model easily accommodates different assumptions on biological polymorphisms. As the catalog of genetic variation from, for example, humans becomes more complete, additional improvements to our approach might include masking positions with known polymorphisms, although it would also increase the computational expense. Another complication from enriching for postmortem damage is the errors that it contributes to downstream analyses, but we found that this could be alleviated by error correction (SI Results and Fig. S10). We used the methodology described in this paper to retrieve a complete mtDNA sequence from a contaminated Siberian Neandertal sample. The Okladnikov 2 sequence is basal to all published complete Neandertal mtDNA sequences from Central and Western Europe, which may relate to it originating from the easternmost Neandertal sample available (30). In addition to the complete mitochondrion of the Mezmaiskaya 1 Neandertal from the Caucasus (29, 34), analyses of smaller fragments of the mt control region suggest that older Neandertals might contain mtDNA lineages basal to the Okladnikov 2 sequence (30, 32), but the discrepancy between the genealogy obtained using a smaller hypervariable section (Fig. S8) and the resolved phylogeny using the mtDNA genome sequence prompts additional large-scale studies of these remains. Although chronological age seems to be the most important factor (23), PMD patterns in contaminant and endogenous sequences will likely be influenced by a complex range of factors, such as temperature, depositional conditions, postexcavation handling, and source tissue (17). Furthermore, it is possible that PMD patterns are also present in contaminant DNA. However, a recent study (23) showed that samples 50–100 y old all showed limited evidence of PMD (less than 10% nucleotide misincorporations in the 5′ end). Thus, while the relative levels of PMD in contaminant vs. endogenous sequences in a sample that contains a mixture of both will usually not be known, as long as substantial patterns of PMD are found in a sequence library (e.g., >10%), it seems reasonable to assume that PMD will be less extensive in the contaminating sequences under most circumstances. We suggest that, after identifying a set of sequences with high PMD scores that are likely to be endogenous, high- to medium-coverage data should be authenticated by analyzing whether the retained sequence reads are from a single individual (4, 10, 24). For low-coverage data, it may but note that, for older specimens with pervasive damage, greater efficiency could be gained by increasing p. We set up a likelihood function of ðLðMPMD jSi Þ = X × PðSi = MatchjMPMD Þ + ð1 − XÞ × PðSi = MismatchjMPMD Þ, [4] where X is an indicator variable. Therefore, X = 1 if Si = Match, and X = 0 if Si = Mismatch; the probabilities are given by Eqs. 1 and 2. The likelihood function for model MNULL is set up in the same way, but in this case, Dz = 0 for all z. To investigate whether an alignment is more likely to have originated from a model with PMD (MPMD) or a model with no PMD (MNULL), we take the natural logarithm of the ratio of the likelihood for each model multiplied over all positions I in the sequence read S that satisfy our criteria. This log-likelihood ratio provides a PMDS, PMDS = log ! ∏Ii L MPMD jSi , I ∏i L MNULL jSi [5] which is informative on how likely a sequence is to have originated from a degraded source. If I = 0, our test statistic is undefined for that particular sequence read, and in practice, such (extremely rare) reads are discarded. Capture and Sequencing of Okladnikov 2 mtDNA. DNA was extracted (35) from about 200 mg bone from a humerus shaft found in Okladnikov Cave in the Altai Mountains in the 1980s. Before mtDNA enrichment, 25 μL DNA extract were turned into a sequencing library using a modified 454 library preparation protocol as described (29, 36), with the exception that Illumina p5 and p7 adapters were used (37). The p7 adapter was modified to avoid contamination from other libraries carrying a unique 7-bp barcode. The primer extension capture mtDNA enrichment protocol was performed as described previously (29) (SI Materials and Methods) using the exact same four 144plex and 143-plex mixes used to retrieve complete Neandertal, Denisovan, and Pleistocene modern human mtDNAs previously (10, 29, 38). The sequencing run was analyzed starting from raw images using the Illumina Genome Analyzer pipeline 1.3.2 and the base caller Ibis (39). Raw sequences 1. Stoneking M, Krause J (2011) Learning about human population history from ancient and modern genomes. Nat Rev Genet 12(9):603–614. 2. Green RE, et al. (2006) Analysis of one million base pairs of Neanderthal DNA. Nature 444(7117):330–336. 3. Noonan JP, et al. (2006) Sequencing and analysis of Neanderthal genomic DNA. Science 314(5802):1113–1118. 4. Rasmussen M, et al. (2010) Ancient human genome sequence of an extinct PalaeoEskimo. Nature 463(7282):757–762. 5. Meyer M, et al. (2012) A high-coverage genome sequence from an archaic Denisovan individual. Science 338(6104):222–226. 6. Fu Q, et al. (2013) DNA analysis of an early modern human from Tianyuan Cave, China. Proc Natl Acad Sci USA 110(6):2223–2227. 7. Keller A, et al. (2012) New insights into the Tyrolean Iceman’s origin and phenotype as inferred by whole-genome sequencing. Nat Commun 3:698. 8. Skoglund P, et al. (2012) Origins and genetic legacy of Neolithic farmers and huntergatherers in Europe. Science 336(6080):466–469. 9. Sánchez-Quinto F, et al. (2012) Genomic affinities of two 7,000-year-old Iberian hunter-gatherers. Curr Biol 22(16):1494–1499. 10. Krause J, et al. (2010) A complete mtDNA genome of an early modern human from Kostenki, Russia. Curr Biol 20(3):231–236. 11. Richards MB, Sykes BC, Hedges RE (1995) Authenticating DNA extracted from ancient skeletal remains. J Archaeol Sci 22(2):291–299. 12. Handt O, Krings M, Ward RH, Pääbo S (1996) The retrieval of ancient human DNA sequences. Am J Hum Genet 59(2):368–376. 13. Kolman CJ, Tuross N (2000) Ancient DNA analysis of human populations. Am J Phys Anthropol 111(1):5–23. 14. Sampietro ML, et al. (2006) Tracking down human contamination in ancient human teeth. Mol Biol Evol 23(9):1801–1807. 15. Wall JD, Kim SK (2007) Inconsistencies in Neanderthal genomic DNA sequences. PLoS Genet 3(10):1862–1866. 16. Green RE, et al. (2009) The Neandertal genome and ancient DNA authenticity. EMBO J 28(17):2494–2502. 17. Pääbo S (1989) Ancient DNA: Extraction, characterization, molecular cloning, and enzymatic amplification. Proc Natl Acad Sci USA 86(6):1939–1943. 18. Hofreiter M, Jaenicke V, Serre D, von Haeseler A, Pääbo S (2001) DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Res 29(23):4793–4799. 19. Hansen AJ, Willerslev E, Wiuf C, Mourier T, Arctander P (2001) Statistical evidence for miscoding lesions in ancient DNA templates. Mol Biol Evol 18(2):262–265. 20. Brotherton P, et al. (2007) Novel high-resolution characterization of ancient DNA reveals C > U-type base modification events as the sole cause of post mortem miscoding lesions. Nucleic Acids Res 35(17):5717–5728. 21. Briggs AW, et al. (2007) Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA 104(37):14616–14621. 2234 | www.pnas.org/cgi/doi/10.1073/pnas.1318934111 obtained from Ibis for the two paired end reads of each sequencing cluster were merged (including adapter removal) as previously described (24, 40). Assembly and Analysis of the Okladnikov 2 mtDNA Genome Sequence. In total, 4,102,487 sequences were obtained after merging read pairs (unmerged reads were excluded). After collapsing PCR duplicates to consensus sequences (40), we obtained 6,906 sequences of 50,822 originally aligned sequences. We assembled the mtDNA sequence using an iterative mapping assembler (31), which takes position-specific PMD patterns into consideration in its scoring matrix. Convergence was observed in two rounds of assembly. For estimating contamination in the Okladnikov 2 sequence, we increased the number of informative sites by identifying positions (with minimum depth of 10) where the consensus base differed from 95% of 311 modern human mtDNAs and where the Okladnikov 2 consensus was not an A or T in a transition polymorphism. The seven fixed transversions between Neandertals and modern humans yielded a similar estimate of ∼10% contamination in Okladnikov 2. We inferred a gene tree relating 68 hominin mtDNA sequences (25, 29, 31, 38, 41) using mrBayes 3.2.0 (42). The Markov chain Monte Carlo was run for 5 million generations, with sampling every 1,000 generations and a burn-in of 1 million generations. A consensus tree was constructed implementing a 50% majority rule. ACKNOWLEDGMENTS. We thank Wieland Binzcik, Susanna Sawyer, Adrian Briggs, Philip Johnson, Allissa Mittnik, Ludovic Orlando, Amhed Velazquez, Per Sjödin, Lucie Gattepaille, Robert Ekblom, Ayça Omrak, Anders Götherström, and Love Dalén for discussions and technical assistance. PMDtools, a python program implementing PMDS computations, is available at www.ebc.uu.se/ Research/IEG/evbiol/research/Jakobsson/software/pmdtools/. Computations were performed on resources provided by the Swedish National Infrastructure for Computing and Uppsala Multidisciplinary Center for Advanced Computational Science under Project b2010050. The laboratory work was financed by the Max Planck Society. Financial support was also provided by the Royal Swedish Academy of Sciences (P.S.) and the European Research Council (M.J.). 22. García-Garcerà M, et al. (2011) Fragmentation of contaminant and endogenous DNA in ancient samples determined by shotgun sequencing; prospects for human palaeogenomics. PLoS One 6(8):e24161. 23. Sawyer S, Krause J, Guschanski K, Savolainen V, Pääbo S (2012) Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA. PLoS One 7(3):e34131. 24. Green RE, et al. (2010) A draft sequence of the Neandertal genome. Science 328(5979):710–722. 25. Reich D, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468(7327):1053–1060. 26. Helgason A, et al. (2007) A statistical approach to identify ancient template DNA. J Mol Evol 65(1):92–102. 27. Rasmussen M, et al. (2011) An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334(6052):94–98. 28. Gilbert MTP, et al. (2007) Whole-genome shotgun sequencing of mitochondria from ancient hair shafts. Science 317(5846):1927–1930. 29. Briggs AW, et al. (2009) Targeted retrieval and analysis of five Neandertal mtDNA genomes. Science 325(5938):318–321. 30. Krause J, et al. (2007) Neanderthals in central Asia and Siberia. Nature 449(7164): 902–904. 31. Green RE, et al. (2008) A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134(3):416–426. 32. Dalén L, et al. (2012) Partial genetic turnover in neandertals: Continuity in the East and population replacement in the West. Mol Biol Evol 29(8):1893–1897. 33. Malmström H, Storå J, Dalén L, Holmlund G, Götherström A (2005) Extensive human DNA contamination in extracts from ancient dog bones and teeth. Mol Biol Evol 22(10):2040–2047. 34. Ovchinnikov IV, et al. (2000) Molecular analysis of Neanderthal DNA from the northern Caucasus. Nature 404(6777):490–493. 35. Rohland N, Hofreiter M (2007) Comparison and optimization of ancient DNA extraction. Biotechniques 42(3):343–352. 36. Margulies M, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380. 37. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):t5448. 38. Krause J, et al. (2010) The complete mitochondrial DNA genome of an unknown hominin from southern Siberia. Nature 464(7290):894–897. 39. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10(8):R83. 40. Kircher M (2012) Analysis of High-Throughput Ancient DNA Sequencing Data. Ancient DNA (Springer, Berlin), pp 197–228. 41. Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408(6813):708–713. 42. Ronquist F, et al. (2012) MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 61(3):539–542. Skoglund et al. Supporting Information Skoglund et al. 10.1073/pnas.1318934111 SI Materials and Methods Postmortem Degradation Score Distributions in Published Datasets. We obtained sequence data from three previous studies (1–3). We remapped the data which originated from HiFi polymerase amplification by Skoglund et al. (3) using Burrows-Wheeler Aligner (BWA) 0.5.9 (4) with the seed region disabled (parameters: −1 16500, −n 0.01, −o 2) and formed consensus sequences of reads with identical outer mapping coordinates (5). We used the calmd program in the samtools (6) suite to recompute the MD field (containing alignment information, such as mismatches) for all datasets. To obtain postmortem degradation score (PMDS) distributions, we excluded positions with base quality < 20 and clipped alignments and alignments with gaps (these filters were used regardless of whether PMDS thresholds were applied). For Neandertal data mapped with ANFO (1), we required a mapping quality of 90, whereas we required a mapping quality of 30 for all other datasets. We mapped two previously obtained contaminated Neandertal sets (7) to the rCRS using BWA with the same parameters as above. Because the data were obtained using a 454 technology that did not produce base quality scores, we assumed a base quality score of 20 at all positions. To estimate contamination, we used 7 fixed transversions as well as 70 fixed transitions between Neandertal and modern human mitochondria [excluding positions where Neandertals had Thymine (T) and modern humans had Cytosine (C) and where Neandertals had Adenine (A) and modern humans had Guanine (G)]. This procedure was to increase the number of informative positions for these low-coverage datasets; for highercoverage datasets, we restricted the analysis to only seven transversions (see below). In Silico Contamination Experiments. In the mtDNA contamination experiment, we did not exclude any reads because of mapping quality but required a base quality of at least 20 for PMDS computation and a depth at each site of at least 20 sequence reads. We estimated modern contamination in the Vindija 33.16 mtDNA by identifying seven transversion polymorphisms that are fixed between 6 complete Neandertal mtDNAs (8, 9) and the mtDNAs of 311 modern humans (10, 11). For the autosomal analysis, we pooled sequence data from three Neolithic hunter-gatherers, randomly choosing a single sequence read at positions where more than one aligned sequence was present, and we successively added more contaminating sequences from a present day French individual (20%, 50%, and 90%). We extracted SNPs using sequences with mapping quality of at least 30 and bases with base quality of at least 30 both with and without filtering for PMDS ≥ 3 (positions with base quality < 30 were excluded from PMDS computation). The number of SNPs ranged from 46,492 (no contamination or PMDS threshold) to 464,920 (90% contamination and no PMDS threshold). We used all SNPs to perform principal component analysis using EIGENSOFT 4.0 (12) with a reference dataset of 504 individuals from Europe and the Levant (13–15), which was haploidized (a single allele was randomly chosen from each diploid individual) like in ref. 16 before analysis. The French individual who was the source of the sequence data used for artificial contamination was excluded from the principal component analysis. One SNP from each pair with r2 value ≥ 0.2 was excluded from the principal component analysis. We performed Procrustes transformation to the principal component 1–principal component 2 configuration obtained using only the reference dataset like in ref. 3. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 Capture and Sequencing of Okladnikov 2 mtDNA. DNA was extracted from about 200 mg bone from a humerus shaft found in Okladnikov Cave in the Altai Mountains in the 1980s as described (17), with one modification: the DNA elution from the silica particles was done using TE (Tris-EDTA) buffer with 0.05% Tween 20. Siliconized tubes were used for long-term storage of the DNA extract. For the sampling of the bone, dentist drill bits were used together with an NSK E-Max Micromotor System. Before mtDNA enrichment, 25 μL DNA extract were turned into a sequencing library using a modified 454 library preparation protocol as described (9, 18), with the exception that Illumina p5 and p7 adapters were used as described elsewhere (19). The p7 adapter was modified to avoid contamination from other libraries carrying a unique 7-bp barcode. The primer extension capture (PEC) mtDNA enrichment protocol was performed as described previously (9) using the exact same four 144-plex and 143-plex mixes used to retrieve complete Neandertal, Denisovan, and Pleistocene modern human mtDNAs previously (7, 9, 20). The PEC primers are 5′-biotinylated and consist of a universal 12-base 5′ spacer sequence (CAAGGACATCCG) followed by a specific primer sequence. To design the specific primer sequences, Primer3 (21) was used to find all possible primer sites on the light strand of the Vindija 33.16 Neandertal mtDNA sequence (8). After the final spin column purification into 50 μL TE buffer, the PEC products were sequenced on the Illumina GAII platform. For this purpose, a PCR primer pair was constructed that is complementary to the p5 and p7 adapters on the 3′ ends. The Illumina/ Solexa library was then amplified in a 100-μL reaction containing 50 μL Phusion High-Fidelity Master Mix, 500 nM each Solexa primer, and 10 μL PEC product template. Annealing temperature was 60 °C, and a total of 10 cycles of PCR was performed. The amplified products were spin column-purified and quantified on an Agilent 2100 Bioanalyzer DNA 1000 Chip. The PEC product was diluted and sequenced according to Illumina GAII protocols on a paired end run with a total of 76 cycles. The sequencing run was analyzed starting from raw images using the Illumina Genome Analyzer pipeline 1.3.2. The first five sequencing cycles were used for cluster identification. After standard base calling, reads of the PhiX 174 control lane were aligned to the corresponding reference sequence to obtain a training dataset for the base caller Ibis (22). Raw sequences obtained from Ibis for the two paired end reads of each sequencing cluster were merged (including adapter removal), requiring at least 11-nt overlap between the two reads. For bases in the overlap, quality scores were summed up. In cases where different bases were called, the base with the higher-quality score was chosen. SI Results Comparison of PMDS Analysis with Simple Filtering Based on Mismatches. We compared the fraction of retained authentic sequences as a function of their enrichment relative to contamination for the PMDS approach with a simpler previously suggested approach, where a sequence is retained if it displays a C→T mismatch within, for example, 1–15 bp of the 5′ terminus of the sequence (3). We found that, compared with the basic method, the PMDS approach offers better tradeoff between the enrichment and the fractionretained endogenous sequences as well as being able to enrich for authentic DNA far beyond what is possible with the simple method of observing a mismatch (Fig. S3). Whereas the basic method is able to achieve its maximum 20-fold enrichment by restricting to sequences with a mismatch in the first position 1 of 10 (assuming contamination similar to the 100-y-old Australian sample and 40,000-y-old Neandertal endogenous sequences), the PMDS method is able to achieve, for instance, a 5,000-fold enrichment while retaining ∼10% of the authentic sequence reads in the same data (Fig. S3). Model for Single-Stranded Libraries. In single-stranded library protocols, the only empirically observed mismatches caused by C deamination events are C → T mismatches at both ends of sequences (23). However, these mismatches could be affected by singlestranded overhangs in the original template molecule from either end. Unlike double-stranded library amplifications, it is not possible to differentiate between deamination caused by 5′ and 3′ singlestranded overhangs in data from single-stranded library preparation. We, thus, consider two possible processes giving rise to postmortem nucleotide polymorphisms Dz and Dy, where z and y denote distance to the 5′ and 3′ termini, respectively. Here, the probability of match is then the sum of the following exclusive events: no damage (from either end), no error, and no polymorphism; a misincorporation has been reverted because of sequence error (two terms: one for Dz and one for Dy); and finally, a polymorphism is reverted because of error conditional on no misincorporation: PðMatchjz; yÞ = ð1 − πÞ × ð1 − «Þ × ð1 − Dz Þ × 1 − Dy + ð1 − πÞ × « × Dz × 1 − Dy + ð1 − πÞ × « × Dy × ð1 − Dz Þ + π × « × ð1 − Dz Þ × 1 − Dy : [S1] Adjusting Base Qualities for PMD. Selecting for nucleotide misincorporations amplifies a problem that already plagues ancient genomics: genotyping short reads with nucleotide misincorporations. PMD-aware consensus sequence calling has been developed for mtDNA sequencing studies (9) but has not been widely adopted for genome-wide sequence data (24), and no software for large-scale genomic data is currently available. One possibility is to adjust the base qualities for the probability of PMD (25), which allows their incorporation into all downstream analyses. We incorporated the PMD model used here into genotyping by probabilistic adjustment of the base quality scores (25) (SI Materials and Methods). Given the parameter Dz of the rate of nucleotide misincorporations, we can adjust the base quality scores according to the position-specific probability of PMD (25). Specifically, C/G reference sites, where a T/A is observed in the sequence read, could be erroneously inferred as a true polymorphism because of either a sequencing error or a nucleotide misincorporation. In the same notation as shown in Materials and Methods, an adjusted phredscaled base quality score Qadjusted is then (25) −Q=10 : [S2] Qadjusted = − 10 × log10 1 − ð1−Dz Þ × 1 − 10 We tested the ability of this approach to alleviate problems with nucleotide misincorporations using the Vindija 33.16 mitochondrial data (1). First, the sequence data were remapped to the revised Cambridge human reference sequence (rCRS) using BWA with the same parameters as above, because ANFO alignments contained a portion of soft clipped regions that would be ignored during base quality adjustment. Second, we called genotypes for the entire mitochondrion using samtools (v0.1.16) pileup (6) in haploid mode, requiring a phred-scaled consensus quality of at least 30. Third, we counted the fraction of genotypes where there was C/T or G/A ambiguity, stratifying by different thresholds on the PMDS statistic, and found that this approach substantially increases the number of sites that can be reliably called (Fig. S9). Skoglund et al. www.pnas.org/cgi/content/short/1318934111 Analysis of the Okladnikov 2 mtDNA Assembly. Using positions that separate the Okladnikov 2 majority allele from 311 modern human mtDNAs (Material and Methods), we found that, of 1,318 informative fragments, 149 fragments were putative contaminants (10.2%; 95% confidence interval = 8.7–11.7%), whereas for 3,908 sequences that had a positive PMDS, only 17 of 1,301 informative fragments were putative contaminants (1.3%; 95% confidence interval = 0.7–1.9%). Average consensus support in the final assembly was 97.0%, with 92% of all positions achieving support of 90% or higher. We observed only one difference between the 15-fold coverage consensus inferred using only sequences with PMDS ≥ 0 and the 25-fold unfiltered data: a T residue at the position corresponding to rCRS position 1,845 in the PMDS ≥ 0 assembly, where the unfiltered (contaminated) assembly displayed a C residue. We aligned the consensus sequences of Okladnikov 2 to complete mitochondrial sequences from 53 modern humans (10), 6 Neandertals (8, 9), 2 Denisovans (20, 26), 1 gorilla, 2 Bonobos, and 3 chimpanzees using MUSCLE (27) with default parameters. In this alignment, we found that all other 67 mitochondria also displayed the C residue, indicating that this position could be either private to Okladnikov 2 or caused by C deamination. However, when increasing the PMDS threshold to three, this position was not reliably called. Hence, we took the conservative approach of excluding this position from the final consensus. We also observed one position that differed between the Okladnikov 2 sequence obtained in this study and the sequence previously obtained by targeting the mtDNA control region detailed in the work by Krause et al. (28). At this position (corresponding to 16,148 in rCRS), the consensus called was a C, whereas the allele determined previously was a T. To investigate the effect on the consensus calls of the 10% contamination rate in the unfiltered data, we repeated the procedure with the full contaminated data and investigated at which positions we had either gained or lost a reliable call by restricting to sequences with PMDS ≥ 0. There were 88 of 16,566 positions that were not reliably called for the PMDS ≥ 0 data (0.53%) compared with 24 of 16,563 positions in the unfiltered data (0.14%). In total, seven called sites were gained by restricting to PMDS ≥ 0, all of which were polymorphic or fixed differences in a set of mtDNA sequences from 6 Neandertal and 53 modern humans (Table S1). In contrast, among the calls that were lost by PMDS ≥ 0, only 4 of 68 sites (5.9%) were polymorphic or fixed differences. These observations are consistent with the prediction that the removal of contamination by the PMDS approach enables base calling at evolutionary informative sites that are otherwise ambiguous because of both contaminant and endogenous alleles being present. In contrast, the reduction from 25to 15-fold coverage in the final PMDS ≥ 0 assembly resulted in random loss of sites without bias for those sites that are polymorphic or fixed differences in the set of 60 hominin mtDNAs. To investigate the effect of reference sequence choice for assembly, we also aligned Okladnikov 2 sequences to the rCRS and observed two differences. The first difference is a CA dinucleotide repeat starting at position 514 in rCRS that displays from three (in Gorilla and the six previously analyzed Neandertals) to seven repeats (in an Aboriginal Australian; AF346965) in our alignment of 67 hominin mitochondria. Here, the Vindija 33.16 assembly had three repeats, consistent with the state in the six previously analyzed Neandertals (but not Denisovans, who have four repeats), whereas the rCRS assembly had five repeats, a state found in several modern human mtDNAs. The second difference is at position 16,263 in rCRS: the Vindija 33.16-based assembly is undetermined at two positions, where the rCRS-based assembly displays CT. After these two position, the six Neandertals previously analyzed also display an A insertion not seen in other sequences. The positions are further complicated by the PCRamplified region discussed in the work by Krause et al. (28), which 2 of 10 shows that the sequence CTA differs from the rCRS-based assembly. We, therefore, chose the conservative approach of masking these two positions in the final assembly based on Vi33.16 and excluding the possible A insertion. Because the relationship between Okladnikov 2, Mezmaiskaya 1, and the five Western Neandertal mtDNAs differed from previous studies, we replicated the phylogenetic analysis using a 304bp alignment sequenced in several additional Neandertals (28–36) and added three modern human sequences: Denisova 1, chimpanzee, and Bonobo sequences. In this analysis of 304 bp (Fig. S6), the deeper divergences between Neandertal mtDNAs remain unresolved, similar to the result found in ref. 28. However, the Okladnikov 2 sequence obtained here is nearly identical with the fragment sequenced by a previous study (28) for 304 bp, showing that the uncalled positions in the new sequence are not responsible for the well-supported topology of the Mezmaiskaya 1 being basal to the Okladnikov 2 sequence and the other complete Neandertal mtDNA genomes. Validation of the Neandertal mtDNA Topology. To confirm that the topology of Neandertal mtDNA genomes, where Mezmaiskaya 1 is basal to Western Neandertals and Okladnikov 2, was not affected by the PMDS filtering of Okladnikov 2, we obtained shotgun sequences from Mezmaiskaya 1, Vindija 33.16, Vindija 33.25, and Vindija 33.36 from the study by Green et al. (1). We realigned mtDNA sequences from these four individuals using BWA with identical parameters as for Okladnikov 2 and downsampled these four datasets and the Okladnikov 2 data to exactly 2,500 randomly chosen sequences. We then excluded sequences with a PMDS below zero and called consensus sequences using the mpileup and vcf2fq tools in the samtools suite with default parameters. We reconstructed an mtDNA sequence phylogeny exactly as in the main analysis and found that the topology of Neandertal mtDNA sequences was exactly the same as in the main analysis, with the Mezmaiskaya 1 sequence basal to the Okladnikov 2 sequence and the Vindija Neandertals (Fig. S7). 1. Green RE, et al. (2010) A draft sequence of the Neandertal genome. Science 328(5979): 710–722. 2. Rasmussen M, et al. (2011) An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334(6052):94–98. 3. Skoglund P, et al. (2012) Origins and genetic legacy of Neolithic farmers and huntergatherers in Europe. Science 336(6080):466–469. 4. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760. 5. Kircher M (2012) Analysis of High-Throughput Ancient DNA Sequencing Data. Ancient DNA (Springer, Berlin), pp 197–228. 6. Li H, et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079. 7. Krause J, et al. (2010) A complete mtDNA genome of an early modern human from Kostenki, Russia. Curr Biol 20(3):231–236. 8. Green RE, et al. (2008) A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134(3):416–426. 9. Briggs AW, et al. (2009) Targeted retrieval and analysis of five Neandertal mtDNA genomes. Science 325(5938):318–321. 10. Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408(6813):708–713. 11. Green RE, et al. (2006) Analysis of one million base pairs of Neanderthal DNA. Nature 444(7117):330–336. 12. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet 2(12):e190. 13. Li JZ, et al. (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319(5866):1100–1104. 14. Surakka I, et al. (2010) Founder population-specific HapMap panel increases power in GWA studies through improved imputation accuracy and CNV tagging. Genome Res 20(10):1344–1351. 15. Altshuler DM, et al. (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467(7311):52–58. 16. Skoglund P, Jakobsson M (2011) Archaic human ancestry in East Asia. Proc Natl Acad Sci USA 108(45):18301–18306. 17. Rohland N, Hofreiter M (2007) Comparison and optimization of ancient DNA extraction. Biotechniques 42(3):343–352. 18. Margulies M, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057):376–380. 19. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):t5448. 20. Krause J, et al. (2010) The complete mitochondrial DNA genome of an unknown hominin from southern Siberia. Nature 464(7290):894–897. 21. Rozen S, Skaletsky H (1999) Primer3 on the WWW for General Users and for Biologist Programmers. Bioinformatics Methods and Protocols (Springer, Berlin), pp 365–386. 22. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10(8):R83. 23. Meyer M, et al. (2012) A high-coverage genome sequence from an archaic Denisovan individual. Science 338(6104):222–226. 24. Rasmussen M, et al. (2010) Ancient human genome sequence of an extinct PalaeoEskimo. Nature 463(7282):757–762. 25. Jónsson H, Ginolhac A, Schubert M, Johnson PL, Orlando L (2013) mapDamage2.0: Fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29(13):1682–1684. 26. Reich D, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468(7327):1053–1060. 27. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797. 28. Krause J, et al. (2007) Neanderthals in central Asia and Siberia. Nature 449(7164): 902–904. 29. Krings M, et al. (1997) Neandertal DNA sequences and the origin of modern humans. Cell 90(1):19–30. 30. Krings M, et al. (2000) A view of Neandertal genetic diversity. Nat Genet 26(2):144–146. 31. Ovchinnikov IV, et al. (2000) Molecular analysis of Neanderthal DNA from the northern Caucasus. Nature 404(6777):490–493. 32. Lalueza-Fox C, et al. (2006) Mitochondrial DNA of an Iberian Neandertal suggests a population affinity with other European Neandertals. Curr Biol 16(16):R629–R630. 33. Lalueza-Fox C, et al. (2005) Neandertal evolutionary genetics: Mitochondrial DNA data from the iberian peninsula. Mol Biol Evol 22(4):1077–1081. 34. Orlando L, et al. (2006) Revisiting Neandertal diversity with a 100,000 year old mtDNA sequence. Curr Biol 16(11):R400–R402. 35. Caramelli D, et al. (2006) A highly divergent mtDNA sequence in a Neandertal individual from Italy. Curr Biol 16(16):R630–R632. 36. Dalén L, et al. (2012) Partial genetic turnover in neandertals: Continuity in the East and population replacement in the West. Mol Biol Evol 29(8):1893–1897. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 3 of 10 Fig. S1. Nucleotide misincorporation patterns arising from C deamination in ancient DNA sequences. (A and B) Empirical patterns in datasets analyzed in this study. (C and D) Rate of deamination assumed in the PMDS model in this study. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 4 of 10 0.200 0.500 0.020 0.050 0.005 0.001 C to T fraction at first position Late Pleistocene Neandertals Neolithic Scandinavians 100 year old Aborigine HGDP cell lines 1 10 100 1000 10000 Age of sample Fig. S2. Relationship between age and observed deamination fraction (C to T mismatches) in the 5′ end of sequence data analyzed in this study. Blue, Neolithic Scandinavians; gray, HGDP cell lines from present day individuals; green, 100-y-old Australian Aborigine hair sample; red, Pleistocene Neandertals. Note the log scale for both the x and y axes. Fig. S3. (A) The amount of retained data as a function of different thresholds on PMDS. (B) The enrichment of ancient over more recent sequence reads as a function of chosen PMDS threshold. In our example data, a PMDS threshold of ∼2 would achieve a 100-fold enrichment of Neandertal and Neolithic Scandinavian sequences over present day (contaminating) sequences, and a PMDS threshold of ∼6 would achieve the same 100-fold enrichment when contaminating the Neandertal or Neolithic Scandinavian data with similar sequences from the 100-y-old remains of the Australian individual. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 5 of 10 0 5 10 15 20 25 Distance from 5' end of sequence read 30 0.5 0.1 0.2 0.3 0.4 A others 0.0 Mismatch frequency in read given G in ref. 0.4 0.1 0.2 0.3 T others 0.0 Mismatch frequency in read given C in ref. Fig. S4. Inferred gene genealogy of contaminated Mezmaiskaya 2 mtDNA sequence data (A) before and (B) after PMDS filtering. A consensus mtDNA sequence for Mezmaiskaya 2 was called using the haploid model of samtools pileup either using all sequences or restricting to sequences with a PMDS ≥ 4. The consensus sequences were subsequently aligned to mtDNA genome sequences from seven modern humans, six Neandertals, two Denisovans, one Bonobo, one chimpanzee, and one gorilla. A neighbor-joining tree was reconstructed for both datasets. −30 −25 −20 −15 −10 −5 0 Distance from 3' end of sequence read Fig. S5. Nucleotide misincorporation patterns in Okladnikov 2 mtDNA sequences. We show the fraction of given nucleotides in the Okladnikov 2 sequence data at positions where the rCRS reference sequence had (Left) a C residue or (Right) an A residue. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 6 of 10 1000 800 600 400 0 200 Number of positions 0 10 20 30 40 50 Depth Fig. S6. Sequencing depth distribution for the Okladnikov 2 mitochondrial genome. Only sequences with PMDS ≥ 0 were used. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 7 of 10 Fig. S7. The position of Okladnikov 2 in the Neandertal mtDNA tree topology is not biased by the PMDS filtering. Mitochondrial sequences from Okladnikov 2, Mezmaiskaya 1, Vindija 33.16, Vindija 33.25, and Vindija 33.26 were down-sampled to 2,500 randomly sampled sequences, of which sequences with negative PMDSs were excluded. A Bayesian tree was reconstructed, and the Neandertal mtDNA sequence tree topology was identical to the one in the main analysis, where only Okladnikov 2 was subjected to PMDS filtering. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 8 of 10 Fig. S8. Phylogenetic analysis of the hypervariable region. The Okladnikov 2 sequence obtained here falls unambiguously with the previously sequenced short fragment. However, the relationship with Mezmaiskaya 1 is not resolved based on this restricted region. A is restricted to sequences covering 304 bp, whereas B also includes the partial Scladina and Teshik Tash sequences (for which the flanking missing sequence was marked as missing data). Skoglund et al. www.pnas.org/cgi/content/short/1318934111 9 of 10 5 Fig. S9. Comparison of PMDS filtering with a basic method of extracting reads with a C→T/G→A mismatch in the terminus. The PMDS results are based on PMDS thresholds of 0 (least enrichment) to 10 (greater enrichment). The basic method results are based on extracting reads with a C→T mismatch at most n bp from the 5′ terminus, where n ranged from 15 (less enrichment) to 1 (greatest enrichment). Higher values of n result in less-efficient enrichment. For both methods, we excluded bases with quality < 20. Fig. S10. Genotyping ancient DNA using base qualities adjusted for PMD. Fraction C/T and G/A genotypes obtained using standard samtools pileup haploid genotyping (gray) and base qualities adjusted using a model of nucleotide misincorporation (black) are displayed. Table S1. Characterization of base calls gained and lost in the PMDS ≥ 0 assembly of Okladnikov 2 State in 6 Neandertal and 53 modern human mtDNA sequences Calls gained in the PMDS ≥ 0 assembly Calls lost in the PMDS ≥ 0 assembly Not polymorphic Fixed differences Polymorphic in Neandertals only Polymorphic in modern humans only Shared polymorphisms Total polymorphisms 0 2 1 3 1 7/7 (100%) 64 2 0 1 1 4/68 (5.9%) We present the number of positions for each category that is polymorphic in a set of 6 previously published Neandertal mtDNAs and 53 modern human mtDNAs. Skoglund et al. www.pnas.org/cgi/content/short/1318934111 10 of 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Separating endogenous ancient DNA from modern day