Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genetics and archaeogenetics of South Asia wikipedia , lookup
Public health genomics wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Ancestry.com wikipedia , lookup
Microevolution wikipedia , lookup
Genetic drift wikipedia , lookup
Genetic studies on Bulgarians wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Population genetics wikipedia , lookup
Human genetic variation wikipedia , lookup
171 Ann. Hum. Genet. (2000), 64, 171–186 Printed in Great Britain Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach : application to African-American populations P. M. MKEIGUE", J. R. CARPENTER", E. J. PARRA# M. D. SHRIVER# " Department of Epidemiology and Population Health, London School of Hygiene & Tropical Medicine, Keppel Street, London WC1E 7HT, UK # Department of Anthropology, College of Liberal Arts, 409 Carpenter Building, The Pennsylvania State University, University Park, PA 16802, USA (Received 12.11.99. Accepted 7.2.00) We describe a novel method for analysis of marker genotype data from admixed populations, based on a hybrid of Bayesian and frequentist approaches in which the posterior distribution is generated by Markov chain simulation and score tests are obtained from the missing-data likelihood. We analysed data on unrelated individuals from eight African-American populations, genotyped at ten marker loci of which two (FY and AT3) are linked (22 cM apart). Linkage between these two loci was detected by testing for association of ancestry conditional on parental admixture. The strength of this association was consistent with European gene flow into the African-American population between five and nine generations ago. To mimic the mapping of an unknown gene in an ‘ affectedsonly ’ analysis, a binary trait was constructed from the genotype at the AT3 locus and a score test was shown to detect linkage of this ‘ trait ’ with the FY locus. Mis-specification of the ancestry-specific allele frequencies – the probabilities of each allelic state given the ancestry of the allele – was detected at three of the ten marker loci. The methods described here have wide application to the analysis of data from admixed populations, allowing the effects of linkage and population structure (variation of admixture between individuals) to be distinguished. With more markers and a more complex statistical model, genes underlying ethnic differences in disease risk could be mapped by this approach. In a population formed by admixture between two or more ancestral founding populations, marker genotype data can be used to estimate the ancestry (origin from one of the founding populations) of the two alleles at each marker locus in each individual studied, the admixture of the individual (proportion of that individual’s genome that has ancestry from each founding Correspondence : Dr Paul McKeigue, Department of Epidemiology and Population Health, London School of Hygiene & Tropical Medicine, Keppel Street, London WC1E 7HT, UK. Tel : j44 (020) 7927 2312 ; Fax : j44 (020) 7580 6897. E-mail : paul.mckeigue!Lshtm.ac.uk population), and the distribution of admixture in the population. As long as the ancestry-specific allele frequencies – the probability of each allelic state given the ancestry of the allele – in the admixed population under study are correctly specified for each marker locus, Bayes’s theorem can be applied to invert these conditional probabilities and obtain the posterior distribution of ancestry at each locus given the observed marker genotypes. With estimates of individual admixture, it is possible to study the relationship of admixture of disease risk, and to control for admixture as a confounder in studies of the associations of other genetic or environmental factors with disease 172 P. M. MK risk. This can help to distinguish between genetic and environmental explanations for ethnic variation in disease risk (Chakraborty & Weiss, 1986). This approach has been used, for instance, to study the relation of blood pressure to admixture in African-Americans (MacLean et al. 1974) and the relation of diabetes to admixture in Mexican-Americans (Chakraborty et al. 1986). With information about the ancestry of alleles at marker loci in individuals of mixed descent, it is possible to map genes underlying ethnic differences in disease risk in a manner analogous to linkage analysis of an experimental cross (McKeigue, 1998). The basis of this approach is to test for association with states of ancestry on chromosomes of mixed descent, conditioning on parental admixture. By combining information from all marker loci in a multipoint analysis, it is possible to estimate accurately the ancestry of the alleles at each locus, even though no single marker is fully informative for ancestry. In principle this approach can extract all the information about linkage that is generated by admixture (McKeigue, 1998), in contrast to other approaches that rely on detecting the allelic association that is secondary to the correlation of states of ancestry at linked loci on chromosomes of mixed descent. As this approach has more in common with linkage analysis of a cross than with conventional linkage disequilibrium mapping, the term ‘ admixture mapping ’ (Zheng & Elston, 1999) is preferable to the term ‘ mapping by admixture linkage disequilibrium ’ coined by earlier writers (Stephens et al. 1994). We describe a novel method for estimating admixture from marker data and exploiting the information generated by admixture to detect linkage, based on a hybrid of Bayesian and frequentist approaches. This is applied to a large dataset of marker genotypes from AfricanAmerican populations. Limitations of existing methods for estimating admixture Classical methods for estimation of admixture from marker data estimate the mean proportion of alleles in the population under study that have ancestry from each founding population, but cannot estimate variation of admixture between individuals (Elston, 1971 ; Reed, 1971 ; Chakraborty, 1975 ; Long & Smouse, 1983). This limits our ability to detect the substructure that occurs in admixed populations where there has been continuing gene flow from one or both founding populations. Existing methods assume that for any given individual the states of ancestry at different marker loci are independent of each other. This assumption will not hold if the marker loci are linked, because the stochastic variation of ancestry on chromosomes of mixed descent generates correlation of ancestry between linked marker loci. The closer the loci, the more highly correlated will be the states of ancestry. This limits the number of autosomal markers that can be used in admixture studies to one per chromosome arm, which is not enough for accurate estimation of individual admixture (Reed, 1973). To estimate individual admixture accurately, it is necessary to model the stochastic variation of ancestry on chromosomes of mixed descent so that information from multiple markers on each chromosome can be combined. To map genes underlying ethnic differences in disease risk in a genome search even higher marker densities are required, so that ancestry of the two alleles at each locus can be estimated at all points on the genome (McKeigue, 1998). Estimation of admixture from marker genotype data relies on specifying the ancestryspecific allele frequencies correctly for the admixed population under study. These frequencies are usually estimated in samples of modern descendants of the founding populations that contributed to the admixed population. Often, however, it is impossible to determine what subpopulations should be sampled to draw a representative sample of descendants of a founding population. For instance we cannot easily draw a sample that is exactly representative of the West African subpopulations that contributed to the African-American population. Even where it is feasible to draw a representative sample from the founding populations, the ancestry-specific allele frequencies within an Detection of linkage in admixed populations admixed population may vary from the allele frequencies in modern descendants of the founding populations as a result of drift, mutation or selection since admixture. To overcome this difficulty we require methods for detecting mis-specification of ancestryspecific allele frequencies and re-estimating the correct frequencies within the admixed population under study. A simple test for mis-specified ancestry-specific allele frequencies is to compare the estimates for population admixture obtained from single loci (Cavalli-Sforza & Bodmer, 1971 ; Elston, 1971). If one of the ancestry-specific allele frequencies at a locus is mis-specified, the estimate of admixture obtained from that locus will vary from the estimate obtained by combining marker genotype data for all loci. This however does not fully exploit the information about ancestry-specific allele frequencies that is available in a dataset in which individuals of mixed descent have been typed at multiple marker loci. Advantages of a Bayesian approach To model admixture in the population adequately, a hierarchical model is required, estimating the admixture of each individual and the distribution of individual admixture in the population. In such a model the admixture of each parental gamete, the ancestry of the two alleles at each locus in each typed individual, and any missing marker genotypes can be specified as ‘ missing data ’. Classical likelihood-based methods are difficult to apply to such a hierarchical model where most of the data are missing. An alternative is to adopt a Bayesian approach (MacLean & Workman, 1973). If we specify a full probability model in which all observed and missing data are treated as random variables and prior distributions are assigned where necessary, the posterior distribution of the missing data given the observed data can be generated by Markov chain simulation (Gelman et al. 1995). Bayesian inference is based on combining the prior distribution and the likelihood to generate a posterior distribution. With non-informative prior distributions (Gelman et al. 1995) and large 173 sample sizes, the posterior distribution is dominated by the likelihood, and Bayesian analyses yield results that are close to those obtained by a frequentist approach. Thus in large samples, the mode (or mean) of the posterior distribution is asymptotically equivalent to the maximumlikelihood estimate, and the 95 % central posterior interval is asymptotically equivalent to a 95 % confidence interval : that is, it has 95 % probability of covering the true value under repeated sampling with any fixed true value of the parameter (Gelman et al. 1995). Once we have generated the posterior distribution of the missing data, the expectation of any quantity of interest over this posterior distribution can be evaluated and we can use a frequentist approach to test whether any of the assumptions of the model should be rejected. For any null hypothesis of interest, it is straightforward to construct a score test from the missing-data likelihood, as described below. With this hybrid of Bayesian and frequentist approaches, it is not necessary to assign a prior distribution to the parameter that we are testing for departure from its null value. In large samples, score tests are asymptotically equivalent to the likelihood ratio tests (lod scores) conventionally used in genetic linkage studies. Data source The African-American population samples and the genotyping have been described in detail elsewhere (Parra et al. 1998). From the ten African-American and Afro-Caribbean population samples studied previously, eight populations that represented the range of mean admixture values were selected for this analysis. The samples were obtained from a paternity testing lab (Houston), blood donation centres (Pittsburgh and New Orleans) and participants in epidemiological studies (New York, Philadelphia, Baltimore, Charleston and Jamaica). The New York sample comprised cases and controls in a study of obesity in African- 174 P. M. MK Americans. The Philadelphia and Jamaica samples were collected as controls in studies of hypertension. The Baltimore sample was collected in an epidemiological study of HIV infection among intravenous drug users. The Charleston sample comprised pregnant women who participated in a study of blood lead levels. Ten marker loci, chosen to have large differentials in allele frequencies between West Africans and Europeans, were typed. Nine of these loci – RB2300, LPL, FY, AT3, APO, SB19.3, ICAM, OCA2, GC – have been described previously (Parra et al. 1998, ID : 4444). RB2300 is a polymorphism in the retinoblastoma gene located at 13q14.3. APO is an alu insertion polymorphism in the APO4 gene at 11q23. ICAM is a polymorphism in the ICAM1 gene at 19p13.3–p13.2. SB19.3 is an alu insertion located on chromosome 19. For this analysis genotypes at an additional locus – L19.2, a SacI polymorphism located on chromosome 11 – were available. Most of these markers are restriction site polymorphisms which were detected by digestion with the appropriate restriction enzyme after PCR. Markers were genotyped by standard PCR and electrophoretic separation of DNA fragments as described previously (Parra et al. 1998). Two of the marker loci – FY and AT3 – are located in the same chromosomal band, 22 cM apart. All markers except GC are biallelic. For simplicity in programming the statistical analysis, the three allelic states at the GC locus were grouped into two categories so that all markers could be treated as biallelic. Allele 1F was coded as allelic state 1, and the two alleles that have higher frequency in Europeans than in west Africans – 1S and 2 – were combined as allelic state 2. Ancestry-specific allele frequencies were estimated from samples of West African (Nigerian and Central African) and European (English, Irish and German) populations as described previously (Parra et al. 1998). The Wahlund variance ( f ), a function of the ancestryspecific allele frequencies, measures the proportion of information about ancestry extracted by a biallelic marker where the prior probabilities of each state of ancestry are equal (McKeigue, 1998). For the 10 markers used, the average fvalue was 0.38, with a range from 0.14 (ICAM ) to 1 (FY ). Statistical model The analysis was based on the directed graphical model shown in Figure 1. As the distribution of admixture in the population cannot be assumed unimodal or even continuous, parental admixture was modelled as a discrete random variable with 33 possible values ranging from 0 to 1 as integer fractions of 32. This random variable was assigned a multinomial distribution with 33 possible outcomes. The conjugate prior for this multinomial distribution is a Dirichlet distribution with a parameter vector that has 33 coordinates. Each of these 33 coordinates was assigned a value of " . This prior $$ distribution can be interpreted as contributing information equivalent to observing the admixture of a single gamete, with equal weight given to each possible outcome. With the sample sizes of more than 150 gametes that were available for all the subpopulations in this study, the contribution of prior information to the posterior distribution is thus small in comparison to the contribution of the observed marker genotype data. The admixture of each parental gamete is modelled as an independent draw from the multinomial distribution of parental admixture. This assumes that mating in the population from which the parental gametes were drawn is not assortative for admixture. The ancestry of the allele transmitted on each parental gamete at each locus is then a Bernoulli variable (0 l European, 1 l African) with probability parameter specified by the admixture of the parental gamete. The probabilities of observing each of the three possible marker genotypes (0, 1 or 2 copies of allele 1) are then simple functions of the ancestry of the two alleles at the marker locus and the ancestry-specific allele frequencies. The score tests for linkage are based on the model in Figure 1, in which there is no association between the ancestry of alleles at different loci, conditional on parental admixture. We have Detection of linkage in admixed populations 175 Fig. 1. Directed graphical model for dependence of marker genotypes on admixture at population level and individual level. Constants are shown as single-edged rectangles, observed data as double-edged rectangles, and stochastic nodes as eclipses. Stochastic dependence is indicated by continuous arrows. Single-edged rectangles represent strata : individuals and loci. shown previously that for unlinked loci this assumption follows directly from Mendel’s law of independent assortment (McKeigue, 1998). To estimate the association between the ancestry of alleles at the two linked marker loci – FY and AT3 – on each gamete, a stochastic node representing the odds ratio for this association was added to the model. The likelihood as a function of the odds ratio, conditional on parental admixture, is given in the Appendix. With a flat prior for the log odds ratio, the program was unable to sample adequately from the posterior distribution. To overcome this problem, the natural logarithm of the odds ratio was assigned a normal prior with mean zero and precision 0.5 (equivalent to variance of 2). This incorporates our prior knowledge that extreme values of the log odds ratio are implausible. For two loci 22 cM apart, the highest possible values for the log odds ratio are 2.8 (three grandparents African, one European) in gametes from a population of individuals with 25 % European admixture, and 2.9 (seven grandparents African, one European) in gametes from a population of individuals with 12.5 % European admixture. Using Markov chain simulation to generate the posterior distribution of the missing data The BUGS program (WinBUGS 1.2, beta test version) (Spiegelhalter et al. 1999) was used to generate the posterior distribution by Gibbs sampling. The dataset of marker genotypes for each study population was analysed separately. Two chains with overdispersed starting values were run in parallel, with an initial run of 1000 iterations to ensure convergence, as assessed by the Gelman–Rubin method (Gelman & Rubin, 1992). After convergence at least 5000 further iterations were run to monitor parameters of interest, continuing if necessary to reduce the Monte Carlo error of the estimates. P. M. MK 176 Construction of score tests from the missing-data likelihood Once we have generated the posterior distribution of the missing data given the observed data, the following method can be used to construct a significance test for any null hypothesis of interest. The gradient of the loglikelihood of the observed data (expressed as a function of a parameter θ) is the efficient score. If the null hypothesis (θ l θ ) is correct, the score ! at θ has expectation zero and variance given by ! the observed information (defined as minus the curvature of the log-likelihood function) at θ . ! We can write down an expression for the loglikelihood of any realization of the complete data as a function of θ, and differentiate this expression twice with respect to θ to obtain the score and information. The score and information (evaluated at θ ) are averaged over the posterior ! distribution of the missing data to yield the score for the observed data and the complete information. The variance of the score over this posterior distribution is the missing information (Louis, 1982). The observed information is calculated by subtracting the missing information from the complete information (Little & Rubin, 1987). The ratio of observed information to complete information is the proportion of information about θ that is extracted by the analysis. For a simple null hypothesis, the score U and observed information V are scalars, and UV−"/# has a standard normal distribution under the null hypothesis. For a composite null hypothesis, where U is a row vector and V is a matrix, a χ# test statistic is calculated as UV−"Uh. Derivations of the score tests used in these analyses are given in the Appendix. Distribution of individual admixture in each population Table 1 shows the estimates for the distribution of individual admixture (in the parental generation) in each population. The estimated mean admixture (mean of the posterior dis- tribution of the population mean) ranged from 0.07 in Jamaica to 0.22 in New Orleans. These estimates are similar to those obtained previously by conventional likelihood-based approaches (Parra et al. 1998). With a Bayesian approach, however, we can also estimate the distribution of individual admixture in the population. The estimated proportion of parents with more than 50 % European admixture was low in most of the populations studied : only in New Orleans was this proportion estimated to be greater than 10 %. The posterior intervals for the proportion of parents with less than 12.5 % (oneeighth) European admixture were wide in comparison with the posterior intervals for the mean admixture of each population. The low information content of these estimates is because only 10 markers have been used in each individual and because the individual’s parents themselves have not been genotyped. Testing for linkage by testing for association of ancestry conditional on parental admixture To examine the ability to detect linkage by testing for association of ancestry conditional on parental admixture, we examined the association between the two linked loci – FY and AT3 – using a score test based on the odds ratio, as described in the Appendix. If this test performs as theory predicts, it should detect association of ancestry between linked loci in populations where admixture has been recent, but should not detect association of ancestry between unlinked loci more often than expected by chance. To examine this, we applied the test to a pair of unlinked loci : FY and OCA2. Of the ten markers used, these two have the highest information content for ancestry, so that if variation of admixture between individuals generates association in the absence of linkage, it is between these two loci that we would expect to detect it most easily. As there is no plausible biological mechanism for an inverse association of ancestry between two loci, one-tailed p-values are given in the Table. For each pair of loci a summary test was obtained by summing the score and the observed 177 Detection of linkage in admixed populations Table 1. Distribution of parental admixture in each African-American subpopulation Jamaica Charleston Philadelphia Baltimore Houston New York Pittsburgh New Orleans n 93 94 303 100 100 236 84 105 Proportion of European admixture Population mean Percent 50 % European Percent 12.5 % European (posterior mean, 95 % PI) (posterior mean, 95 % PI) (posterior mean, 95 % PI) 0.07 (0.04–0.10) 1 (1–4) 79 (33–100) 0.12 (0.08–0.16) 2 (0–8) 59 (0–73) 0.15 (0.13–0.16) 0 (1–2) 12 (0–45) 0.15 (0.12–0.19) 5 (0–15) 36 (0–88) 0.16 (0.15–0.19) 3 (0–12) 24 (0–75) 0.21 (0.18–0.24) 7 (0–20) 9 (0–32) 0.22 (0.18–0.26) 3 (0–15) 11 (0–59) 0.22 (0.18–0.26) 13 (0–26) 36 (0–81) * PI, posterior interval. Table 2. Association of ancestry between loci FY and AT3, and between FY and OCA2, conditional on parental admixture Score at Percent Log Loci odds ratio Observed information Z test One-tailed odds Population tested of 1 information extracted statistic p-value ratio Jamaica FY-AT3 1.22 0.67 12 1.49 0.07 1.0 FY-OCA2 k0.15 0.04 1 k0.71 0.76 Charleston FY-AT3 2.17 1.14 11 2.04 0.02 1.3 FY-OCA2 0.78 0.20 2 1.74 0.04 Philadelphia FY-AT3 2.13 5.82 13 0.88 0.19 0.2 FY-OCA2 k1.94 0.62 2 k2.47 0.99 Baltimore FY-AT3 1.71 1.84 15 1.26 0.10 1.0 FY-OCA2 k0.24 0.20 2 k0.53 0.70 Houston FY-AT3 2.50 2.41 13 1.61 0.05 1.1 FY-OCA2 0.99 0.68 4 1.19 0.12 New York FY-AT3 7.69 7.68 16 2.77 0.003 1.8 FY-OCA2 k1.59 1.46 4 k1.32 0.91 Pittsburgh FY-AT3 k0.95 3.17 19 k0.53 0.70 k0.5 FY-OCA2 k0.01 0.84 5 k0.02 0.51 New Orleans FY-AT3 4.08 2.83 16 2.42 0.008 1.9 FY-OCA2 k0.53 0.59 4 k0.69 0.76 All FY-AT3 20.55 25.57 15 4.06 2i10−5 FY-OCA2 k2.70 4.63 3 k1.25 0.90 information across all the populations studied (Table 2). For loci FY and AT3, the summary score test was significant at a p-value of 2i10−&, whereas for loci FY and OCA2 there was no evidence of association ( p l 0.9). When the score tests in each population were examined separately, the association of ancestry between FY and AT3 was statistically significant at the 5 % level in three of the eight AfricanAmerican populations studied : Charleston, New York and New Orleans. The overall proportion of information about the odds ratio that was extracted by the analysis was low : 14 % for the FY–AT3 odds ratio, and 3 % for the FY–OCA2 odds ratio. 95 % posterior interval k2.0, 3.5 k1.4, 3.5 k1.6, 1.6 k1.3, 3.0 k1.3, 3.0 0.5, 3.0 k2.8, 1.3 0, 3.6 Bayesian estimates (posterior means) for the log odds ratio for association of ancestry at loci FY and AT3 ranged from k0.5 in Pittsburgh to 1.9 in New Orleans. As the posterior intervals for the log odds ratio were wide, it was not possible to establish whether there was heterogeneity of the odds ratio between populations. Mapping the AT3 gene as if the phenotype were a binary trait determined by an unknown locus To mimic the mapping of an unknown disease gene in an affecteds-only analysis, we defined a binary trait in which those with at least one copy of allele 2 (the allele more common in Europeans) P. M. MK 178 Table 3. ‘ Affecteds-only ’ score test for linkage with the FY locus, using a ‘ trait ’ constructed from the AT3 genotype Proportion of European ancestry at FY locus Population Jamaica Charleston Philadelphia Baltimore Houston New York Pittsburgh New Orleans All n 33 37 123 46 45 120 34 47 485 Observed 0.12 0.18 0.18 0.20 0.24 0.33 0.21 0.33 0.24 Expected 0.11 0.15 0.16 0.18 0.19 0.24 0.24 0.31 0.20 Information Score 0.53 1.08 3.37 0.81 2.64 10.83 k1.16 0.76 18.85 at the AT3 locus were coded as ‘ affected ’. The prevalence of this ‘ trait ’ is 0.24 in unadmixed west Africans and 0.92 in unadmixed Europeans : a population risk ratio of 3.9. If we sample individuals ‘ affected ’ with this trait from an admixed population, at any marker locus that is linked to AT3 the observed proportion of alleles that are of European ancestry will be higher than expected on the basis of the parental admixture of these individuals. This principle can be applied to construct an affecteds-only test for linkage, as outlined previously (McKeigue, 1998). If we assume a multiplicative genetic model, the effect of the locus can be represented by the population risk ratio r. As shown in the Appendix, the score at r l 1 is simply the observed minus the expected proportion of alleles that have European ancestry at the marker locus, summed over all individuals and averaged over the posterior distribution of the missing data. As for this ‘ trait ’ the correct genetic model is a dominant one, the score test based on a multiplicative model will be less powerful than one based on the correct model. However this does not affect the validity of the p-values obtained from the score test, as these are calculated on the assumption that the null hypothesis (r l 1) is correct. Where we have established in advance that the trait is more common in the high-risk population (Europeans in this example of a trait defined by the genotype at the AT3 locus) than in the lowrisk population (West Africans in this example), it is appropriate to use a one-tailed test. Observed 1.41 2.15 8.28 2.24 4.62 15.85 0.61 3.40 38.56 Percent extracted 73 72 76 69 83 85 37 71 78 Z-test statistic 0.44 0.74 1.17 0.54 1.23 2.72 k1.48 0.41 3.04 One-tailed p-value 0.33 0.23 0.12 0.29 0.11 0.003 0.93 0.34 0.001 To construct the test dataset, ‘ affected ’ individuals were defined as above and the AT3 genotype was dropped from the data. As shown in Table 3, the affecteds-only score test detected significant evidence of linkage of the ‘ trait ’ with the FY locus in all populations combined (summary p-value 0.001), and separately in the New York sample ( p l 0.003). Testing for mis-specification of ancestry-specific allele frequencies The score test for mis-specification of the ancestry-specific allele frequencies yields three test statistics : a Z statistic (standard normal deviate) for the African-specific allele frequency, a Z statistic for the European-specific allele frequency, and a summary chi-square statistic with 2 degrees of freedom for the joint null hypothesis that both ancestry-specific allele frequencies are correctly specified. A positive value of the Z statistic (positive gradient of the log-likelihood function at the null value) implies that the most likely value of the ancestry-specific allele frequency is higher than the value specified in the model, and a negative value that the most likely value is lower than the specified value. Table 4 shows, for the four largest population samples, the results of these score tests. The percentage of information extracted by the analysis is typically around 80 %. The summary chi-square statistic for the four African-American populations combined yielded significant evidence ( p-values less than 0.01) for 179 Detection of linkage in admixed populations Table 4. Score tests for mis-specified ancestry-specific allele frequencies Locus name pAfr pEur RB2300 0.92 0.333 LPL 0.973 0.486 FY 0 1 AT3 0.874 0.279 APO 0.441 0.927 SB 19.3 0.425 0.91 ICAM 0.756 1 OCA2 0.098 0.769 L19.2 0.089 0.541 GC 0.824 0.156 Philadelphia Z African-specific k0.13 k0.55 0.92 k1.00 k0.62 0.05 k1.82 k2.43 1.45 2.53 % info 97 94 91 98 99 99 100 97 98 98 Z European-specific k0.11 k0.44 0.98 k1.03 k0.42 0.04 k1.93 k2.36 1.36 2.46 % info 94 97 92 94 86 85 77 91 95 87 Summary χ# 0.05 1.05 1.38 1.08 2.12 0.34 4.03 6.02 2.43 6.49 Baltimore Z African-specific 0.51 k1.00 k0.80 k1.66 k0.53 1.17 2.61 k1.86 0.37 1.60 % info 90 83 69 93 98 98 99 91 92 95 Z European-specific 0.81 k1.16 k0.06 k1.38 k0.75 1.18 1.69 k2.10 1.35 1.42 % info 79 91 58 79 66 66 60 69 84 64 Summary χ# 0.71 1.34 3.03 2.75 0.57 1.61 7.14 4.51 0.14 2.61 New York Z African-specific 0.63 k0.49 1.32 k2.60 k0.55 0.84 k1.86 k1.60 1.99 1.71 % info 87 76 63 90 97 97 99 88 91 93 Z European-specific 1.40 k1.43 1.31 k2.53 0.41 0.91 k1.52 k1.34 k2.41 1.21 % info 82 92 54 84 72 70 52 77 86 72 Summary χ# 3.00 4.63 1.82 7.05 4.42 0.85 3.56 2.58 5.81 2.97 New Orleans Z African-specific 1.05 0.28 0.60 k1.07 0.77 k1.60 1.32 0.31 k0.44 0.03 % info 90 85 67 90 97 98 99 89 91 93 Z European-specific 1.93 1.29 0.89 k1.47 k0.08 k1.45 1.78 0.41 k0.26 0.16 % info 76 87 74 88 54 65 74 83 88 77 Summary χ# 5.87 7.07 1.77 2.97 2.34 2.56 4.31 0.22 0.29 0.10 All four populations combined Z African-specific 0.69 k1.01 1.21 k3.09 k0.74 0.57 k1.23 k3.20 0.11 3.26 % info 92 86 76 94 98 98 100 93 95 96 Z European-specific 1.68 k1.38 1.69 k3.24 k0.30 0.69 k1.24 k2.88 k0.74 2.65 % info 85 93 65 86 72 72 59 80 89 74 Summary χ# 5.58 2.18 3.81 10.70 0.98 0.47 1.61 10.20 3.42 10.70 Z scores with absolute value greater than 2.33, or chi-square values greater than 9.21 (shown in bold) are significant at p 0.01. mis-specification of the ancestry-specific allele frequencies at three of the ten loci : AT3, OCA2 and GC. The Z statistics for the African-specific and the European-specific allele frequencies are correlated with each other, for reasons that are explained in the Discussion. Examination of the Z statistics in Table 4 suggests that where the ancestry-specific allele frequencies are misspecified, this mis-specification is fairly consistent across the African-American populations studied. Thus at the AT3 locus, the Z statistics are negative in all four populations, even though their deviation from zero is statistically significant only in the New York sample. Similarly, at the GC locus the Z statistics are positive in all four populations studied, even though their deviation from zero is statistically significant only in the Philadelphia sample. In this analysis we are able to estimate the distribution of individual admixture within the African-American populations studied, even though we cannot estimate admixture accurately for any single individual. In most of these populations admixture varies only over a narrow range, and fewer than 10 % of individuals who identify themselves as African-American have more than 50 % European admixture. This limites the statistical power of epidemiological studies designed to detect relationships between disease risk and individual admixture in AfricanAmericans. To overcome this limitation, one could identify other populations in which admixture varies over a wider range, or use more markers so that individual admixture can be 180 P. M. MK estimated more accurately. We can estimate (using the large-sample variance of the maximum-likelihood estimator) that at least 40 unlinked markers with average f-value of 0.4 would be required to estimate individual admixture with a standard error of no more than 0.1. Even more markers would be required if there is association of ancestry (independent of parental admixture) between some of the markers, as there will be if the markers are linked and admixture has been recent. Our approach to the detection of linkage in admixed populations differs fundamentally from previous approaches that rely on detection of allelic association (linkage disequilibrium) (Chakraborty & Weiss, 1988 ; Briscoe et al. 1994 ; Kaplan et al. 1998 ; Zheng & Elston, 1999). Instead of testing for allelic association, we use marker genotype data to extract information about states of ancestry on chromosomes of mixed descent, then test for association with ancestry at the marker locus. The association between states of ancestry at two linked loci in an admixed population depends only upon the history of admixture and the map distance between the loci ; it does not, of course, depend upon what markers are typed. The analysis demonstrates with real data a result derived previously (McKeigue, 1998) ; that if we condition on parental admixture, a test for association with ancestry at a marker locus is a specific test for linkage. The score test detects association between the two linked loci – FY and AT3 – but not between unlinked loci. In a previous analysis of this dataset, a conventional test for linkage disequilibrium detected allelic association between two unlinked loci – FY and OCA2 – in the New Orleans sample at a pvalue of less than 1 % (Parra et al. 1998). This association between unlinked loci is eliminated by conditioning on parental admixture, as theory predicts (McKeigue, 1998). The analysis extracts less than 20 % of the information about the odds ratio for association of ancestry that we would have if the complete data (ancestry at both loci on each gamete) were available. This is because without typing information on other family members we have no information on phase, and because no other loci adjacent to the AT3 or the OCA2 locus have been typed. By typing other family members, and by typing more markers in the region under study, the proportion of information extracted could be increased as desired. The strength of the association of ancestry (measured as the log odds ratio) between two linked loci separated by a given map distance, conditioned on parental admixture, is a measure of how recently admixture has occurred. For most of the African-American populations in this study, estimates of the log odds ratio for association of ancestry between loci FY and AT3 are in the range of 1 to 2, although the posterior intervals are wide. Using the equations given in the Appendix, we can calculate that log odds ratios between 1 and 2 in African-American populations where proportionate European admixture is three-sixteenths (18.75 %) are consistent with European gene flow between five and nine generations ago. If most European gene flow into these African-American populations had occurred more recently than this, the log odds ratios would be in the range of 2 to 3. If we assume exponential distributions for the lengths of chromosomal segments that are of African and European ancestry, we can calculate the density of transitions of ancestry on chromosomes of mixed descent – the ancestry crossover rate – from the log odds ratio for loci FY and AT3, the map distance between these loci and the mean admixture as described in the Appendix. These calculations show that log odds ratios between 1 and 2 for loci FY and AT3 imply ancestry crossover rates between 3.9 and 2.2 per Morgan. These estimates should be considered to be only approximate, as the assumption of exponentially-distributed segment lengths does not in general hold in admixed populations (McKeigue, 1998). The ancestry crossover rate has practical implications for studies that exploit admixture to map genes. For an initial genome search, the optimal strategy is to choose populations where admixture is recent (crossover rate less than 3 per Morgan), thus keeping down the Detection of linkage in admixed populations number of markers required and the number of independent hypotheses to be tested. For finer mapping of a trait locus identified in such a study, one would try to identify populations where admixture has occurred less recently (crossover rates greater than 4 per Morgan). The ‘ affecteds-only ’ analysis mimics the approach that we have suggested for mapping genes underlying ethnic differences in disease risk (McKeigue, 1998). This analysis detects linkage in all populations combined, and in New York separately with a sample of 121 ‘ affected ’ individuals. We have estimated previously that in a population where proportionate admixture from the high-risk population is between 0.3 and 0.7, a sample of about 130 affected individuals is adequate to detect linkage at p 0.001 with a locus that accounts for a population risk ratio of 3. As the FY locus is fully informative for ancestry, the main limitations on our ability to detect linkage in this analysis are that the proportion of European admixture in these populations is low and the marker locus is 22 cM from the trait locus. In practice, admixture mapping would use more closely-spaced markers (McKeigue, 1998). Obtaining the posterior distribution of ancestry at marker loci given marker genotype data depends upon correctly specifying the ancestry-specific allele frequencies in the admixed population under study. As long as these frequencies are correctly specified, the analysis does not depend upon any assumptions about population history as we are simply applying Bayes’s theorem to invert conditional probabilities. The ability to test for mis-specification of the ancestry-specific allele frequencies, and to reestimate the correct frequencies where necessary, is thus crucial to the application of admixture mapping in practice. The score test that we have derived detects mis-specification of ancestryspecific allele frequencies at three of the ten marker loci, but does not clearly distinguish whether it is the African-specific frequencies, the European-specific frequencies, or both, that are mis-specified. At each locus, the scores for the African-specific and the European-specific fre- 181 quencies are highly correlated. This is because with only 10 markers we cannot accurately estimate the admixture of each individual, even though we can accurately estimate the average admixture of the population. One way to convey this argument is to think of the score test conditional on parental admixture as comparing at each locus in each individual the observed allele frequency with the expected frequency given the admixture of the individual’s parents. The expected frequency is a weighted average of the African-specific and European-specific allele frequencies, with weights equal to the admixture of the individual’s parents. If we have sufficient information to estimate accurately the average admixture of the population, we can estimate accurately the expected frequency of each allele in the total sample. This allows us to detect a difference between observed and expected frequencies, but does not tell us whether it is the European-specific or the African-specific allele frequencies that are mis-specified. If individual admixture varies over a wide range in the sample, and enough markers have been typed for individual admixture to be estimated accurately, the score test will be able to distinguish which ancestry-specific allele frequencies are misspecified. If, for instance, only the Europeanspecific allele frequencies are mis-specified, the difference between observed and expected frequencies will be largest in those individuals who have the highest proportion of European admixture. In a full multipoint analysis, additional information about the ancestry of the alleles at each locus would be available from modelling the stochastic variation of ancestry on chromosomes of mixed descent and combining information from all markers to estimate ancestry at each locus (McKeigue, 1998). We could then use a score test conditional on ancestry at each locus, rather than simply conditioning on parental admixture as in this study. In principle it would be straightforward to re-estimate the ancestryspecific allele frequencies within an admixed population under study using a Bayesian approach. If we assign prior distributions to these parameters that reflect our uncertainty about 182 P. M. MK their values, the Gibbs sampler will estimate their posterior means. We have not attempted to re-estimate the ancestry-specific allele frequencies in this dataset, because with only 10 markers we do not have enough information for these re-estimates to be accurate. From a practical point of view, we note that in a previous analysis of this dataset estimates of population admixture obtained from single loci did not differ significantly (Parra et al. 1998). The effect of any mis-specification of ancestry-specific allele frequencies upon estimates of overall admixture is therefore likely to be small. Mis-specification of allele frequencies at the OCA2 locus could affect the validity of the test for linkage between FY and OCA2, but there is no evidence in Table 2 of a false-positive result (positive association between ancestry at the OCA2 locus and the FY locus, conditional on parental admixture) in the summary score test. These preliminary analyses suggest that the ancestry-specific allele frequencies are likely to be fairly consistent across different AfricanAmerican subpopulations. At the three loci where a summary chi-square test for all four populations combined indicates that ancestry-specific allele frequencies are mis-specified, the deviations of the Z statistics from zero are generally in the same direction in each subpopulation. This suggests that even though diverse west African subpopulations contributed genes to the modern African-American populations, within the pool of alleles that are of African ancestry the allele frequencies do not vary much between regions of the US. Studies with larger samples and more markers are required to resolve this. This study demonstrates the application of some of the statistical techniques that we have proposed using to map genes underlying ethnic differences in disease risk (McKeigue, 1998). We have shown that the posterior distribution of parental admixture and ancestry at each marker locus, conditional on the observed marker genotype data, can be generated by Markov chain simulation. We have shown that score tests based on the missing-data likelihood can be applied to detect linkage and to detect mis- specification of ancestry-specific allele frequencies within the admixed population under study. We have shown, using the genotype at the AT3 locus as if it were a binary trait determined by an unknown gene, that a simple ‘ affectedsonly ’ score test can detect linkage of a marker locus at a distance of 22 cM from a trait locus. To extend the methods described in this paper to a search of the genome for genes underlying ethnic differences in disease risk would require more markers, a more complicated statistical model, and far more computing power, but would introduce no fundamentally new principles beyond those demonstrated in this paper. Simulations suggest that about 1000 biallelic markers with average f-values of 0.4 are required to extract 80 % of information about ancestry in an initial genome search in a population where the ancestry crossover rate is 2 per Morgan (McKeigue, 1998). At the current rate of progress in identification and typing of single nucleotide polymorphisms (SNPs), identification of such marker sets will soon be feasible even if it is necessary to screen a library of 50 000 or more SNPs to select those that have high f-values. For a full multipoint analysis that uses data on affected individuals and their relatives, it will be necessary to model the stochastic variation of inheritance (maternally-derived or paternallyderived allele transmitted) and ancestry on chromosomes of mixed descent. The approach can be extended to deal with admixture between three or more founding populations, or to handle multi-allelic markers. In conclusion, these results provide further evidence that admixture mapping is a feasible approach to mapping genes for complex traits. As it is now clear that conventional allele-sharing designs lack adequate statistical power to map genes for complex traits with realistic sample sizes (Risch & Merikangas, 1996), such novel methodological approaches may offer the best chance for finding human genes that influence the risks of conditions such as diabetes, hypertension, obesity and autoimmune disease. We thank David Clayton for suggesting the use of a score test based on the missing data likelihood. 183 Detection of linkage in admixed populations : , Testing for linkage between two marker loci by testing for association between the ancestry of alleles conditional on parental admixture At each locus on each gamete there are two possible states of ancestry : African or European population Y. The null hypothesis is that, conditional on parental admixture, the odds ratio for the association between states of ancestry at any two loci on the same gamete loci is 1. For two loci A and B on a single gamete, we write pij for the probability of ancestry state j at locus B given ancestry state i at locus A, where i and j have value 0 for African ancestry and 1 for European ancestry. There are three possible outcomes – 0, 1 or 2 alleles with European ancestry – for which the likelihoods are Π l (1kM)(1kp ), Π l 2(1kM)p , and Π l Mp , where Πi is the probability that the gamete ! !" " !" # "" has (at loci A and B combined) a total of i alleles that have European ancestry, and M is the proportion of the parent’s genome that is of European ancestry. We can write the conditional probabilities of each ancestry state at locus B given the state at locus A as a matrix of transition probabilities in which the rows and columns represent states of ancestry at loci A and B respectively. p !" !" 91kp 1kp p : "" "" (1) As the probabilities of European ancestry at locus A and locus B are both equal to M, we have (1kM)p jMp l M !" "" (2) We can write the odds ratio in terms of the transition probabilities as p (1kp ) !" . ψ l "" p (1kp ) !" "" (3) From these equations we can obtain the likelihood of each of the three possible outcomes in terms of M and ψ : 1 N1j4M(1kM) (ψk1)k1 Π l 1kMk for ψ 1, l (1kM)# ! 2 (ψk1) N1j4M(1kM) (ψk1)k1 (ψk1) N 1 1j4M(1kM) (ψk1)k1 Π l Mk # 2 (ψk1) Π l " for ψ l 1 for ψ 1, l 2M(1kM) for ψ l 1 for ψ 1, l M# for ψ l 1 The first and second derivatives with respect to ψ of N1j4M(1kM) (ψk1)k1 at ψ l 1 (ψk1) (evaluated as the limit from above) are k4M#(1kM)# and 4M$(1kM)$. At ψ l 1, the scores corresponding to these three possible realized states are therefore 2M#, k2M(1kM) and 2(1kM)#. The corresponding values of the information are 2M$(1jM), 2M#(1kM)#, and 2(1kM)$(2kM). For each realization of the posterior distribution, the score and information are summed over all gametes in the dataset. P. M. MK 184 Relationships of crossover rate and odds ratio for association of ancestry between linked loci to history of admixture If parents 1 and 2 have admixture M and M and produce gametes on which the probabilities of " # European ancestry at locus B given European ancestry at locus A (equivalent to coordinate p in "" the transition matrix (1) above) are P and P , the expected admixture of their offspring is "(M jM ) " # # " # and these offspring will produce gametes on which the probability of European ancestry at locus B given European ancestry at locus A is (1kθ) (M P jM P )j2θM M " " # # " #, M jM " # where θ is the recombination fraction between A and B Using this equation repeatedly, it is possible to calculate for any given history of admixture the transition probability p in the transition matrix (1) above. From p , the other coordinates of the "" "" transition matrix, and the odds ratio ψ can be calculated using equations (2) and (3) above. If the lengths of chromosomal segments of African and European ancestry are distributed exponentially with parameters µ and λ respectively on a gamete with admixture M, the transition matrix for two loci separated by map distance x is 9 : 1 λjµe−(λ+µ)x µkµe−(λ+µ)x λjµ λkλe−(λ+µ)x µjλe−(λ+µ)x and M l µ\(λjµ). The parameters λ and µ can thus be calculated from x, ψ and M. The crossover rate is "(λjµ). # Affecteds-only test for linkage of a disease or binary trait with a marker locus The null hypothesis is that the population risk ratio r that the locus accounts for is 1. Suppose that we sample only affected individuals from the population under study. We have shown previously (McKeigue, 1998) that if a multiplicative model for penetrance applies, the probability Πi that an affected individual has i alleles that have ancestry from the high-risk population (Europeans in this example where the trait is defined by the presence of allele 1 at the AT3 locus), conditional on parental admixture (which determines the probabilities P , P , P ) is : ! " # N P P r Pr ! " # Π l , Π l , Π l ! P jP NrjP r " P jP NrjP r # P jP NrjP r ! " ! " ! " # # # We write d for the realized proportion of alleles at the marker locus that have European ancestry in the affected individual. The log likelihood for this realization, given that the individual is affected, is d log rklog (P jP NrjP r) ! " # and the score function is "P r−"/#jP d # k # " r P jP r"/#jP r ! " # At r l 1, the score is simply the realized minus the expected proportion of alleles that have European ancestry : U l dk("P jP ) # " # and the information is dk"P k("P jP )#. % " # " # For each realization, these expressions are summed over all individuals in the dataset. 185 Detection of linkage in admixed populations Testing for mis-specification of ancestry-specific allele frequencies We write pX for the frequencies of allele 1 given African ancestry, pY for the frequency of the allele given European ancestry, and P , P , P for the probabilities that the parents of the individual ! " # transmit 0, 1, 2 alleles of European ancestry. These probabilities are functions of parental admixture. The probabilities Πi of observing the genotype with i copies of allele 1 in an individual, conditional on parental admixture, are as below : Π l qX#P jqX qY P jqY#P ! " ! # Π l 2pX qX Poj(pX qYjqX pY)P j2pY qY P ! " # Π l pX#P jpX pY P jpY#P . # " ! # Differentiating the logarithms of these three expressions with respect to pX and pY, we obtain expressions for the score vector for the three possible observed genotypes : P Y # 9k2qX PΠ!!kqY P" kqX PΠ"k2q : ! 2(q kpX)P j(qYkpY)P (qXkpX)P j2(qYkpY)P ! " " #: U l9 X Π Π " " 2p P jpY P pX P j2pY P " " #: U l9 X ! Π Π # # U0 l l [u X u Y] ! ! 1 l [u X u Y] " " 2 l [u X u Y]. # # Differentiating again with respect to pX, pY and multiplying by minus 1 yields the corresponding expressions for the information matrix : −"P — ! ! 9uu!X!Xu#k2Π − kΠ "P u Y#k2Π −"P : !Y ! " ! ! # u X#j4Π −"P — " ! I l9 " u X u Yj2Π −"P u Y#j4Π −"P : " " " " " " # u X#k2Π −"P — # ! I l9 # u X u YkΠ −"P u Y#k2Π −"P : # " # # # # # I0 l 1 2 For each realization, the score and information are summed over all individuals in the dataset (excluding those with missing genotypes at the locus under study). R Briscoe, D., Stephens, J. C. & O’Brien, S. J. (1994). Linkage disequilibrium in admixed populations : applications in gene mapping. J. Hered. 85, 59–63. Cavalli-Sforza, L. L. & Bodmer, W. F. (1971). The genetics of human populations, San Francisco : Freeman. Chakraborty, R. (1975). Estimation of race admixture – a new method. Am. J. Phys. Anthropol. 42, 507–511. Chakraborty, R., Ferrell, R. E., Stern, M. P., Haffner, S. M., Hazuda, H. P. & Rosenthal, M. (1986). Relationship of prevalence of non-insulindependent diabetes mellitus to Amerindian admixture in the Mexican Americans of San Antonio, Texas. Genet. Epidemiol. 3, 435–454. Chakraborty, R. & Weiss, K. M. (1986). Frequencies of complex diseases in hybrid populations. Am. J. Phys. Anthropol. 70, 489–503. Chakraborty, R. & Weiss, K. M. (1988). Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc. Natl. Acad. Sci. USA 85, 9119–9123. Elston, R. C. (1971). The estimation of admixture in racial hybrids. Ann. Hum. Genet. 35, 9–17. Gelman, A., Carlin, D. B., Stern, H. S. & Rubin, D. B. (1995). Bayesian data analysis, London : Chapman & Hall. 186 P. M. MK Gelman, A. & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–511. Kaplan, N. L., Martin, E. R., Morris, R. W. & Weir, B. S. (1998). Marker selection for the transmission\ disequilibrium test in recently admixed populations. Am. J. Hum. Genet. 62, 703–712. Little, R. J. A. & Rubin, D. B. (1987). Statistical analysis with missing data, New York : Wiley. Long, J. C. & Smouse, P. E. (1983). Intertribal gene flow between the Yecuana and Yanomama : genetic analysis of an admixed village. Am. J. Phys. Anthropol. 61, 411–422. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. R. Statistical Soc. Series B 44, 226–232. Maclean, C. J., Adams, M. S., Leyshon, W. C., Workman, P. J., Reed, T. E., Gershowitz, H. & Weitkamp, L. R. (1974). Genetic studies on hybrid populations. III. Blood pressure in an American Black community. Am. J. Hum. Genet. 26, 614–626. Maclean, C. J. & Workman, P. L. (1973). Genetic studies on hybrid populations. I. Individual estimates of ancestry and their relation to quantitative traits. Ann. Hum. Genet. 36, 341–351. McKeigue, P. M. (1998). Mapping genes that underlie ethnic differences in disease risk : methods for detecting linkage in admixed populations by conditioning on parental admixture. Am. J. Hum. Genet. 63, 241–251. Parra, E. J., Marcini, A., Akey, J., Martinson, J., Batzer, M. A., Cooper, R., Forrester, T., Allison, D. B., Deka, R., Ferrell, R. E. & Shriver, M. D. (1998). Estimating African-American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63, 1839–1851. Reed, T. E. (1971). The population variance of the proportion of genetic admixture in human intergroup hybrids. Proc. Natl. Acad. Sci. USA 68, 3168–3169. Reed, T. E. (1973). Number of gene loci required for accurate estimation of ancestral population proportions in individual human hybrids. Nature 244, 575–576. Risch, N. & Merikangas, K. (1996). The future of genetic studies of complex human diseases. Sciences 273, 1516–1517. Spiegelhalter, D. J., Thomas, A., Best, N. G. & Gilks, W. R. (1999). BUGS : Bayesian inference using Gibbs sampling. WinBUGS version 1.2, Cambridge : Medical Research Council Biostatistics Unit. Stephens, J. C., Briscoe, D. & O’Brien, S. J. (1994). Mapping by admixture linkage disequilibrium in human populations : limits and guidelines. Am. J. Hum. Genet. 55, 809–824. Zheng, C. & Elston, R. C. (1999). Multipoint linkage disequilibrium mapping with particular referent to the African-American population. Genet. Epidemiol. 17, 79–101.