Download Supplementary Text (doc 64K)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genetic variation in the Sorbs of eastern Germany in the
context of broader European genetic diversity:
Supplementary Text
A Brief History of the Sorbs
From the eighth century, western written sources record numerous Slavic groups living
between the Elbe and Saale rivers in the West, the Odra in the East, the Baltic Sea in the
North and the ridge of the Ore Mountains in the South. It is impossible to know whether
these names employed in Latin texts written outside of this region designate individual,
self-consciously ethnic or political communities, whether they are collective
designations of historians and geographers, or some mix of the two. There is also no
evidence of any self-conscious group identity uniting these Slavic groups, whose social
and political organization was decentralized, consisting primarily of small communities
of farmers only occasionally brought under some sort of central political control. Today
scholars refer to these Slavic groups collectively as “Polabians” (in German: Elbslaven)
although it must be emphasized that this is a modern collective designation that has no
historical resonance. The Sorbs are but one of the many groups that made up the
Polabians.
The first mention of the Sorbs (Surbi, Sorbi), which occurs in the Fourth Book of the
Chronicle of Fredegar, written in the first half of the eighth century, states that some
time around 630 one Dervan, “duke of the people of the Sorbs, who were of the Slavic
1
people (dux gente Surbiorum, que ex genere Sclauinorum erant) long subject to the
kingdom of the Franks, placed himself and his people under the rule of Samo,” a leader
of another Slavic group also known as the Winidi (Wends). The term Wend was a
Germanic term that designated Slavic speaking neighbors and was used widely along
the linguistic borderlands between Germanic and Slavic regions. A ninth-century list of
peoples compiled by the anonymous “Bavarian geographer” included numerous Slavic
peoples north of the Danube, among them some large tribes or tribal unions including
the Abodrites (Abodriti, Obotriti), the Veletians (Weleti, Wilzi) and the Sorbs (Surbi,
Sorbi).
The relationship between the groups identified by these sources as Sorbi and the Sorbs
of today is unclear. The territories of these early Sorbs lay between the Saale and Elbe
rivers while the Slavic groups that lived in what is today Sorbian territory were called
Lusici and Milzeni. From the ninth century forward, German eastward expansion
resulted in the linguistic and ethnic displacement or absorption of virtually all of the
Polabians, with the exception of the modern Sorbs. This process was at times peaceful
and collaborative, at other times aggressive, with an emphasis on conversion to
Christianity and incorporation into the expanding east Frankish kingdom. This
culminated in a general uprising of the Polabians, calling themselves by the collective
name Luticians (probably meaning the “ferocious”) against the Saxon kings in 983, an
uprising that took the better part of a century to defeat. However even during this
period, the lines were not so clearly drawn between German and Slavic groups. The
Emperor Otto II had Slavic soldiers in his bodyguard and in the first half of the eleventh
century Sorbian warriors served in the followings of the Markgraves of Meissen.
2
The pre-eighth century history of the Sorbs or indeed of all of the Polabians is
impossible to establish. Earlier generations of linguists and archaeologists, convinced
that they could recognize linguistic and ethnic communities by distinctive
archaeological materials, attempted to argue for the “original homeland” of the Slavs,
placing it variously between Podolia and Volhynia (west and south central Ukraine)
along the Pripet river in Polesie (northern Ukraine, southern Belarus, and parts of
Russia and southern Poland), in the area of Prague, or possibly around Chernyakhov in
Ukraine. From wherever this original homeland was located, it was assumed that Slavic
speakers migrated throughout Eastern Europe sometime in the first centuries of the
Common Era, although they only appear in written sources in the sixth century. German
scholars, who accepted as literally true accounts of massive Germanic migrations,
assumed that Slavic groups like the Sorbs spread into areas abandoned by Germanic
peoples who had migrated into the Roman Empire. More recently some scholars have
argued that the Slavs emerged only in the sixth century along the Danubian frontier, and
while this theory is disputed, it is generally accepted that Slavs made up a large portion
of the heterogeneous Avar empire that developed along the Danubian basin between the
Byzantines to the East and the Bavarians to the west between the sixth and ninth
centuries. The appearance of small Slavic groups such as the Sorbi mentioned in
Fredegar are presumed to have some connection to the waxing and waning of the Avar
confederation, with groups bearing separate names emerging in times of weakness only
to be reconsolidated in times of strength before being released entirely with the
destruction of the Avar empire by the Franks around 800.
3
Bibliography:
Christian Lübke, Fremde im östlichen Europa. Von Gesellschaften ohne Staat zu
verstaatlichten Gesellschaften (9.-11.Jahrhundert) (Ostmitteleurpoa in Vergagenheit und
Gegenwart Bdf. 23 (Böhlau Verlag Köln Weimar Wien, 2001).
Christian Lübke, “Christianity and Paganism as Elements of Gentile Identities to the
East of the Elbe and Saale Rivers,” in Ildar Garipzanov, Patrick Geary, and
Prezemyslaw Urbanczyk, Franks, Northmen, and Slavs. Identities and State Formation
in Early Medieval Europe, (Turnhout: Brepols, 2008) 189-203.
Florin Curta, The Making of the Slavs: History and Archaeology of the Lower Danube
Region, c. 500-700 (Cambridge: Cambridge University Press, 2001).
Supplementary Methods
Generation of a homogenous, unrelated set of Sorbs
In order to generate a set of Sorbian individuals who were relatively homogenous but
seemingly unrelated we used the following protocol:
Sorbs from the POPRES/LPZ merge (see below) were extracted and the pairwise
proportion of loci that are IBD (IBD) between each pair of individuals was calculated
using plink (the PI_HAT statistic of the –genome option) for all 289 Sorbs. For any
pairwise comparison with a IBD value of greater than 0.05 we removed one sample of
4
that pair from the dataset. The sample from the pair to be dropped was chosen
randomly, unless the samples had additional pair-wise comparisons with other
individuals that resulted in IBD values greater than 0.05, in which case the sample with
more pairwise IBD > 0.05 was chosen to be dropped. The above procedure resulted in a
homogenous dataset of 178 putatively unrelated samples. A PCA plot of these samples
is shown in Supp Fig 8. The PCA outlier removal routine of smartpca 1, found no
samples greater than 6 standard deviations from the mean on any of the first 20
eigenvectors.
Dataset Merging & Quality control
Depending on the analysis different merged datasets were used in our study. The
POPRES and LPZ datasets were merged, with all non-autosomal loci removed from
these merges (total number of SNPs after merging = 341,163). All merged datasets were
filtered to remove genomic regions identified in Novembre et al. 2 as evidencing
aberrant patterns of LD, such that they might have undue influence on subsequent
analyses of genome-wide patterns of population structure (total number of SNPs after
filtering = 299,875). Further filtering was conducted to remove SNPs with high linkage
disequilibrium using a simple pairwise threshold and was performed in plink under the
following parameters: window size=50, the number of SNPs to shift a window at each
step=5, genotypic r2 threshold=0.8 (total number of SNPs after filtering = 175,061).
When merging data sets, a major concern is the impact of batch effects associated with
sample preparation1. For example, when merging the LPZ and POPRES datasets
(known hence force as the POPRES/LPZ merge) we were concerned with batch effects
between the POPRES and LPZ samples and amongst LPZ batches. As a means to assess
5
and control for batch effects, the LPZ sample was deliberately designed to include a
sample of 10 German individuals. As both the POPRES and LPZ samples contain
German individuals, one would expect that in the absence of batch effects, the two
samples of German individuals would appear to be highly similar (which we assessed
using PCA analyses). When we filtered the data to include high-confidence SNPs (i.e.
those for which there were only 5 missing genotypes across the entire merged dataset)
the Germans from both samples overlapped each other in PC1 vs. PC2 plots (see Supp
Fig 1A and 1B; total number of SNPs after filtering = 30,587). This process also
removed internal LPZ batch effects (see Supp Fig 1C and 1D).
After applying these filters, we had remaining 30,587 high confidence, approximately
independent SNPs.
This methodology was likely over-conservative, resulting in a loss of 144,474 SNPs.
Examination of the MAF spectrum before and after filtering high missingness SNPs
(Supp Fig 9) showed a bias towards the loss of mostly very low frequency (<0.01) and
monomorphic variants. The genotyping algorithms would likely have had greater
difficulty calling these loci in Sorbs because of a lack of clear clustering into one of the
three possible genotypes. Any ascertainment bias caused by the missingness filter would
have been against SNPs that are generally monomorphic in European populations and
thus are likely to be non-informative with regard to the genetic architecture of European
populations.
6
A merge of the POPRES and HGDP (denoted as the POPRES/HGDP merge) was
performed in the same manner described above except that there was no filtering for
batch effects using missingness levels as these two dataset have previously been well
validated. The merge resulted in 72,765 approximately independent SNPs. In order to
match the SNP density of the POPRES/LPZ dataset a random 30,587 SNPs were used
for subsequent analysis.
Finally a POPRES/LPZ/HGDP merge was performed as described above, including
filtering for missingness, resulting in a data set with 8745 high confidence,
approximately independent SNPs.
All merging was conducted using scripts to execute a series of plink 3, 4 functions.
Principle Components Analysis (PCA)
The purpose of the PCA analysis of the merged POPRES/LPZ dataset was to examine
where the Sorbs may lie in relation to other European samples. One concern with PCA
is that if a group shows excess relatedness (e.g. at the extreme, a family is including in a
PCA), the PCA will separate out the group with excess relatedness from the rest of the
sample even if the group is from the same population as the rest of the sample (results
not shown). To ensure that the PCA is not distorted by excess relatedness amongst the
population isolates, we implemented a ‘drop one in’ procedure for incorporating them
into the analysis. Specifically, PCA analysis was performed for each individual from a
population isolate along with all other European samples. Each population isolate
sample’s resultant PC coordinates for the first two components were then plotted
7
together along with the average PC co-ordinates for each European sample across all
runs. This procedure is more effective than simply projecting data points onto the PCs,
which tends to artificially push coordinates towards the mean value of the PCs (results
not shown).
To investigate the sampling variance in the observed median PC1-PC2 positions for
each country (based on the ‘one at-a-time’ PC coordinates), we conducted bootstrapping
analyses by re-sampling individuals within each country with replacement over 10,000
iterations (as described by Novembre at al. 2, Supplementary Material).
All PCA was performed using smartpca 1 under default parameters with 0 outlier
detection iterations unless otherwise stated.
Ancestry Estimation
Maximum likelihood estimation of individual ancestries was performed using admixture
software 5 using default values (the block relaxation algorithm, a termination criterion
set to stop when the log-likelihood increases by less than  = 10−4 between iterations
and the quasi-Newton convergence acceleration method with q = 3 secant conditions).
We assumed an unsupervised (i.e. no fixed ancestry or parental groups are assumed)
admixture model. To ensure reliable convergence to the global maximum, we ran
admixture at each K (K = 1 - 8) value 100 times using different random seeds and
examined the final log-likelihood scores (LLs) for consistent convergence across runs.
For all K values admixture converged to either exactly the same or very similar
likelihood values over each set of 100 iterations. Admixture co-efficient plots for K=2 -
8
6 are based on one random iteration from this set of 100 (within populations, individuals
in the plots are sorted by the 3 most dominant admixture components for a given K, if
applicable, to aid clarity of viewing). We assessed the cross-validation error for each
value of K using Admixture’s Cross Validation procedure (hold-out fraction 0.1) and
visualized these values across all iterations in a boxplot (Supp Fig 3).
Measure of Inbreeding
Inbreeding Coefficient, F: The calculation of inbreeding coefficients (F, based on the
observed versus expected number of homozygous genotypes for an individual) was
performed using plink routines. Runs of Homozygosity: Analysis of Runs of
Homozygosity (ROH) was performed using plink routines. The exact parameters used
for estimating ROH were determined by the SNP density of the POPRES/LPZ dataset,
with parameters chosen in order to exclude short and common ROHs that are not
informative about the level of IBD, as described by McQuillan et al. 6. The plink
parameters used were: homozyg-window-kb = 5000, homozyg-window-snp = 20,
homozyg-window-het = 1, homozyg-window-missing = 2, homozyg-window-threshold
= 0.05, homozyg-snp = 7, homozyg-kb = 1000, homozyg-density = 150, homozyg-gap
= 500. The POPRES/LPZ and POPRES/HGDP merges were analyzed separately to
maximize the SNP density with which to identify homozygous regions. Linkage
Disequilibrium Decay (LD) Analysis. LD was quantified using both the genotypic r2
statistic (using a modified version of the –r2 function in plink) and the haplotypic r2
statistic (based on phased haplotypes and computed with custom scripts) for both the
POPRES/LPZ and POPRES/HGDP datasets (prior to any LD filtering described above).
For each population with sample size n>=10, we computed the r2 for all SNP pairs with
9
a physical distance of <=70.5kb, placed them into bins of size 500bp ranging from 0500bp up to 70000-70500bp and calculated the mean within each bin. As described by
Jakobsson et al. 7, in order to control for changes in r2 due to sample size effects, for
each SNP pair we randomly re-sampled 10 individuals when computing r2. We
smoothed the subsequent curves by applying a moving average (20-point for
POPRES/LPZ and 30-point for POPRES/HGDP) on the mean r2 values and plotted
these resultant values against the central distance value (in basepairs) for each bin.
References for Supplementary Text
1. Patterson N, Price AL, Reich D: Population structure and eigenanalysis.
PLoS Genet 2006; 2: e190
2. Novembre J, Johnson T, Bryc K et al: Genes mirror geography within
Europe. Nature 2008;
3. Purcell S, Neale B, Todd-Brown K et al: PLINK: a tool set for wholegenome association and population-based linkage analyses. Am J Hum
Genet 2007; 81: 559-575
4. Shaun Purcell (2010) PLINK (v1.07)
5. Alexander DH, Novembre J, Lange K: Fast model-based estimation of
ancestry in unrelated individuals. Genome Res 2009; 19: 1655-1664
6. McQuillan R, Leutenegger AL, bdel-Rahman R et al: Runs of
homozygosity in European populations. Am J Hum Genet 2008; 83: 359372
10
7. Jakobsson M, Scholz SW, Scheet P et al: Genotype, haplotype and copynumber variation in worldwide human populations. Nature 2008; 451: 9981003
11