Download Supplementary Text (doc 64K)

Genetic variation in the Sorbs of eastern Germany in the context of broader European genetic diversity: Supplementary Text A Brief History of the Sorbs From the eighth century, western written sources record numerous Slavic groups living between the Elbe and Saale rivers in the West, the Odra in the East, the Baltic Sea in the North and the ridge of the Ore Mountains in the South. It is impossible to know whether these names employed in Latin texts written outside of this region designate individual, self-consciously ethnic or political communities, whether they are collective designations of historians and geographers, or some mix of the two. There is also no evidence of any self-conscious group identity uniting these Slavic groups, whose social and political organization was decentralized, consisting primarily of small communities of farmers only occasionally brought under some sort of central political control. Today scholars refer to these Slavic groups collectively as “Polabians” (in German: Elbslaven) although it must be emphasized that this is a modern collective designation that has no historical resonance. The Sorbs are but one of the many groups that made up the Polabians. The first mention of the Sorbs (Surbi, Sorbi), which occurs in the Fourth Book of the Chronicle of Fredegar, written in the first half of the eighth century, states that some time around 630 one Dervan, “duke of the people of the Sorbs, who were of the Slavic 1 people (dux gente Surbiorum, que ex genere Sclauinorum erant) long subject to the kingdom of the Franks, placed himself and his people under the rule of Samo,” a leader of another Slavic group also known as the Winidi (Wends). The term Wend was a Germanic term that designated Slavic speaking neighbors and was used widely along the linguistic borderlands between Germanic and Slavic regions. A ninth-century list of peoples compiled by the anonymous “Bavarian geographer” included numerous Slavic peoples north of the Danube, among them some large tribes or tribal unions including the Abodrites (Abodriti, Obotriti), the Veletians (Weleti, Wilzi) and the Sorbs (Surbi, Sorbi). The relationship between the groups identified by these sources as Sorbi and the Sorbs of today is unclear. The territories of these early Sorbs lay between the Saale and Elbe rivers while the Slavic groups that lived in what is today Sorbian territory were called Lusici and Milzeni. From the ninth century forward, German eastward expansion resulted in the linguistic and ethnic displacement or absorption of virtually all of the Polabians, with the exception of the modern Sorbs. This process was at times peaceful and collaborative, at other times aggressive, with an emphasis on conversion to Christianity and incorporation into the expanding east Frankish kingdom. This culminated in a general uprising of the Polabians, calling themselves by the collective name Luticians (probably meaning the “ferocious”) against the Saxon kings in 983, an uprising that took the better part of a century to defeat. However even during this period, the lines were not so clearly drawn between German and Slavic groups. The Emperor Otto II had Slavic soldiers in his bodyguard and in the first half of the eleventh century Sorbian warriors served in the followings of the Markgraves of Meissen. 2 The pre-eighth century history of the Sorbs or indeed of all of the Polabians is impossible to establish. Earlier generations of linguists and archaeologists, convinced that they could recognize linguistic and ethnic communities by distinctive archaeological materials, attempted to argue for the “original homeland” of the Slavs, placing it variously between Podolia and Volhynia (west and south central Ukraine) along the Pripet river in Polesie (northern Ukraine, southern Belarus, and parts of Russia and southern Poland), in the area of Prague, or possibly around Chernyakhov in Ukraine. From wherever this original homeland was located, it was assumed that Slavic speakers migrated throughout Eastern Europe sometime in the first centuries of the Common Era, although they only appear in written sources in the sixth century. German scholars, who accepted as literally true accounts of massive Germanic migrations, assumed that Slavic groups like the Sorbs spread into areas abandoned by Germanic peoples who had migrated into the Roman Empire. More recently some scholars have argued that the Slavs emerged only in the sixth century along the Danubian frontier, and while this theory is disputed, it is generally accepted that Slavs made up a large portion of the heterogeneous Avar empire that developed along the Danubian basin between the Byzantines to the East and the Bavarians to the west between the sixth and ninth centuries. The appearance of small Slavic groups such as the Sorbi mentioned in Fredegar are presumed to have some connection to the waxing and waning of the Avar confederation, with groups bearing separate names emerging in times of weakness only to be reconsolidated in times of strength before being released entirely with the destruction of the Avar empire by the Franks around 800. 3 Bibliography: Christian Lübke, Fremde im östlichen Europa. Von Gesellschaften ohne Staat zu verstaatlichten Gesellschaften (9.-11.Jahrhundert) (Ostmitteleurpoa in Vergagenheit und Gegenwart Bdf. 23 (Böhlau Verlag Köln Weimar Wien, 2001). Christian Lübke, “Christianity and Paganism as Elements of Gentile Identities to the East of the Elbe and Saale Rivers,” in Ildar Garipzanov, Patrick Geary, and Prezemyslaw Urbanczyk, Franks, Northmen, and Slavs. Identities and State Formation in Early Medieval Europe, (Turnhout: Brepols, 2008) 189-203. Florin Curta, The Making of the Slavs: History and Archaeology of the Lower Danube Region, c. 500-700 (Cambridge: Cambridge University Press, 2001). Supplementary Methods Generation of a homogenous, unrelated set of Sorbs In order to generate a set of Sorbian individuals who were relatively homogenous but seemingly unrelated we used the following protocol: Sorbs from the POPRES/LPZ merge (see below) were extracted and the pairwise proportion of loci that are IBD (IBD) between each pair of individuals was calculated using plink (the PI_HAT statistic of the –genome option) for all 289 Sorbs. For any pairwise comparison with a IBD value of greater than 0.05 we removed one sample of 4 that pair from the dataset. The sample from the pair to be dropped was chosen randomly, unless the samples had additional pair-wise comparisons with other individuals that resulted in IBD values greater than 0.05, in which case the sample with more pairwise IBD > 0.05 was chosen to be dropped. The above procedure resulted in a homogenous dataset of 178 putatively unrelated samples. A PCA plot of these samples is shown in Supp Fig 8. The PCA outlier removal routine of smartpca 1, found no samples greater than 6 standard deviations from the mean on any of the first 20 eigenvectors. Dataset Merging & Quality control Depending on the analysis different merged datasets were used in our study. The POPRES and LPZ datasets were merged, with all non-autosomal loci removed from these merges (total number of SNPs after merging = 341,163). All merged datasets were filtered to remove genomic regions identified in Novembre et al. 2 as evidencing aberrant patterns of LD, such that they might have undue influence on subsequent analyses of genome-wide patterns of population structure (total number of SNPs after filtering = 299,875). Further filtering was conducted to remove SNPs with high linkage disequilibrium using a simple pairwise threshold and was performed in plink under the following parameters: window size=50, the number of SNPs to shift a window at each step=5, genotypic r2 threshold=0.8 (total number of SNPs after filtering = 175,061). When merging data sets, a major concern is the impact of batch effects associated with sample preparation1. For example, when merging the LPZ and POPRES datasets (known hence force as the POPRES/LPZ merge) we were concerned with batch effects between the POPRES and LPZ samples and amongst LPZ batches. As a means to assess 5 and control for batch effects, the LPZ sample was deliberately designed to include a sample of 10 German individuals. As both the POPRES and LPZ samples contain German individuals, one would expect that in the absence of batch effects, the two samples of German individuals would appear to be highly similar (which we assessed using PCA analyses). When we filtered the data to include high-confidence SNPs (i.e. those for which there were only 5 missing genotypes across the entire merged dataset) the Germans from both samples overlapped each other in PC1 vs. PC2 plots (see Supp Fig 1A and 1B; total number of SNPs after filtering = 30,587). This process also removed internal LPZ batch effects (see Supp Fig 1C and 1D). After applying these filters, we had remaining 30,587 high confidence, approximately independent SNPs. This methodology was likely over-conservative, resulting in a loss of 144,474 SNPs. Examination of the MAF spectrum before and after filtering high missingness SNPs (Supp Fig 9) showed a bias towards the loss of mostly very low frequency (<0.01) and monomorphic variants. The genotyping algorithms would likely have had greater difficulty calling these loci in Sorbs because of a lack of clear clustering into one of the three possible genotypes. Any ascertainment bias caused by the missingness filter would have been against SNPs that are generally monomorphic in European populations and thus are likely to be non-informative with regard to the genetic architecture of European populations. 6 A merge of the POPRES and HGDP (denoted as the POPRES/HGDP merge) was performed in the same manner described above except that there was no filtering for batch effects using missingness levels as these two dataset have previously been well validated. The merge resulted in 72,765 approximately independent SNPs. In order to match the SNP density of the POPRES/LPZ dataset a random 30,587 SNPs were used for subsequent analysis. Finally a POPRES/LPZ/HGDP merge was performed as described above, including filtering for missingness, resulting in a data set with 8745 high confidence, approximately independent SNPs. All merging was conducted using scripts to execute a series of plink 3, 4 functions. Principle Components Analysis (PCA) The purpose of the PCA analysis of the merged POPRES/LPZ dataset was to examine where the Sorbs may lie in relation to other European samples. One concern with PCA is that if a group shows excess relatedness (e.g. at the extreme, a family is including in a PCA), the PCA will separate out the group with excess relatedness from the rest of the sample even if the group is from the same population as the rest of the sample (results not shown). To ensure that the PCA is not distorted by excess relatedness amongst the population isolates, we implemented a ‘drop one in’ procedure for incorporating them into the analysis. Specifically, PCA analysis was performed for each individual from a population isolate along with all other European samples. Each population isolate sample’s resultant PC coordinates for the first two components were then plotted 7 together along with the average PC co-ordinates for each European sample across all runs. This procedure is more effective than simply projecting data points onto the PCs, which tends to artificially push coordinates towards the mean value of the PCs (results not shown). To investigate the sampling variance in the observed median PC1-PC2 positions for each country (based on the ‘one at-a-time’ PC coordinates), we conducted bootstrapping analyses by re-sampling individuals within each country with replacement over 10,000 iterations (as described by Novembre at al. 2, Supplementary Material). All PCA was performed using smartpca 1 under default parameters with 0 outlier detection iterations unless otherwise stated. Ancestry Estimation Maximum likelihood estimation of individual ancestries was performed using admixture software 5 using default values (the block relaxation algorithm, a termination criterion set to stop when the log-likelihood increases by less than  = 10−4 between iterations and the quasi-Newton convergence acceleration method with q = 3 secant conditions). We assumed an unsupervised (i.e. no fixed ancestry or parental groups are assumed) admixture model. To ensure reliable convergence to the global maximum, we ran admixture at each K (K = 1 - 8) value 100 times using different random seeds and examined the final log-likelihood scores (LLs) for consistent convergence across runs. For all K values admixture converged to either exactly the same or very similar likelihood values over each set of 100 iterations. Admixture co-efficient plots for K=2 - 8 6 are based on one random iteration from this set of 100 (within populations, individuals in the plots are sorted by the 3 most dominant admixture components for a given K, if applicable, to aid clarity of viewing). We assessed the cross-validation error for each value of K using Admixture’s Cross Validation procedure (hold-out fraction 0.1) and visualized these values across all iterations in a boxplot (Supp Fig 3). Measure of Inbreeding Inbreeding Coefficient, F: The calculation of inbreeding coefficients (F, based on the observed versus expected number of homozygous genotypes for an individual) was performed using plink routines. Runs of Homozygosity: Analysis of Runs of Homozygosity (ROH) was performed using plink routines. The exact parameters used for estimating ROH were determined by the SNP density of the POPRES/LPZ dataset, with parameters chosen in order to exclude short and common ROHs that are not informative about the level of IBD, as described by McQuillan et al. 6. The plink parameters used were: homozyg-window-kb = 5000, homozyg-window-snp = 20, homozyg-window-het = 1, homozyg-window-missing = 2, homozyg-window-threshold = 0.05, homozyg-snp = 7, homozyg-kb = 1000, homozyg-density = 150, homozyg-gap = 500. The POPRES/LPZ and POPRES/HGDP merges were analyzed separately to maximize the SNP density with which to identify homozygous regions. Linkage Disequilibrium Decay (LD) Analysis. LD was quantified using both the genotypic r2 statistic (using a modified version of the –r2 function in plink) and the haplotypic r2 statistic (based on phased haplotypes and computed with custom scripts) for both the POPRES/LPZ and POPRES/HGDP datasets (prior to any LD filtering described above). For each population with sample size n>=10, we computed the r2 for all SNP pairs with 9 a physical distance of <=70.5kb, placed them into bins of size 500bp ranging from 0500bp up to 70000-70500bp and calculated the mean within each bin. As described by Jakobsson et al. 7, in order to control for changes in r2 due to sample size effects, for each SNP pair we randomly re-sampled 10 individuals when computing r2. We smoothed the subsequent curves by applying a moving average (20-point for POPRES/LPZ and 30-point for POPRES/HGDP) on the mean r2 values and plotted these resultant values against the central distance value (in basepairs) for each bin. References for Supplementary Text 1. Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet 2006; 2: e190 2. Novembre J, Johnson T, Bryc K et al: Genes mirror geography within Europe. Nature 2008; 3. Purcell S, Neale B, Todd-Brown K et al: PLINK: a tool set for wholegenome association and population-based linkage analyses. Am J Hum Genet 2007; 81: 559-575 4. Shaun Purcell (2010) PLINK (v1.07) 5. Alexander DH, Novembre J, Lange K: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 2009; 19: 1655-1664 6. McQuillan R, Leutenegger AL, bdel-Rahman R et al: Runs of homozygosity in European populations. Am J Hum Genet 2008; 83: 359372 10 7. Jakobsson M, Scholz SW, Scheet P et al: Genotype, haplotype and copynumber variation in worldwide human populations. Nature 2008; 451: 9981003 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supplementary Text (doc 64K)