Download Educational Items Section Genetic Linkage Analysis Atlas of Genetics and Cytogenetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neocentromere wikipedia , lookup

Polyploid wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Tag SNP wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Medical genetics wikipedia , lookup

Heritability of IQ wikipedia , lookup

Genetic drift wikipedia , lookup

Gene expression programming wikipedia , lookup

Genetic engineering wikipedia , lookup

Human genetic variation wikipedia , lookup

Designer baby wikipedia , lookup

Genome-wide association study wikipedia , lookup

Genetic testing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Population genetics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
Atlas of Genetics and Cytogenetics
in Oncology and Haematology
OPEN ACCESS JOURNAL AT INIST-CNRS
Educational Items Section
Genetic Linkage Analysis
Françoise Clerget-Darpoux
Unité de Recherche d'Epidémiologie Génétique, INSERM U535, Kremlin-Bicêtre, France (FCD)
Published in Atlas Database: May 2002
Online updated version: http://AtlasGeneticsOncology.org/Educ/LinkageLongID30031EL.html
DOI: 10.4267/2042/37914
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 France Licence.
© 2002 Atlas of Genetics and Cytogenetics in Oncology and Haematology
I- Genetic linkage analysis
I-1. Recombination fraction
I- 2. Definition of the "lod score" of a family
I- 3. Test for linkage
I- 4. Estimation of the recombination fraction
I- 5. Recombination fraction for a disease locus and a marker locus
I- 6. Linkage analysis for three loci : the phenomenon of interference
I- 7. References
II- Genetic heterogeneity of localization
II- 1. The "Predivided sample test"
II- 2. The "Admixture Test"
II- 3. Generalization of the "admixture test"
II- 4. References
III- Statistical properties of the method of lod scores
III- 1. The test procedure
III- 1.1. Impact of non-sequentiality
III- 1.2. Maximization of the lod score over the [0, 1/2] interval
III- 1.3 References
III-2. Genotype information
III-2.1. Ambiguity in phenotype-genotype relationships at the disease locus
III-2.2. Ambiguity in the marker genotype
III-2.3. Gamete disequilibrium between alleles at the disease locus and at the
marker locus
III-3. The problem of multiple tests
III-4. References
reflected in the recombination fraction, θ which is the
percentage of the gametes transmitted by the parents to
be recombined. If they are transmitted independently,
there will be the same number of recombined gametes
as there are parental gametes, and so θ = 1/2. If they are
I- Genetic linkage analysis
Investigating the linked segregation of genes situated at
different loci is a way of testing the independence of
their transmission. This concept of independence is also
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
323
Genetic Linkage Analysis
Clerget-Darpoux F
not transmitted independently, then the parenteral
gametes are transmitted preferentially to the
recombined gametes, and 0≤ θ<1/2. In this case, there
is said to be "linkage" between the two loci.
I-1. Recombination fraction
Let us consider the caseof two loci, A and B, with two
codominant alleles at each of these loci, A1, A2 and B1,
B2 respectively. Such an individual can produce four
types of gamete:
A1B1
A2B1
A1B2
A2B2
Two situations are possible:
Figure 3
Gametes A1B1 and A2B2 are said to be "parental". In
the offspring, as in the parents, A1 is "coupled" with B1
(and A2 is "coupled" with B2).
The gametes A1B2 and A2 B1 are therefore described as
being "recombined". An uneven number of
recombination or "crossing-over" phenomena have
occurred between the A and B loci.
The proportion of recombined gametes amongst the
gametes
transmitted
is
known
as
the
“recombination fraction”.
1- The loci A and B are on different chromosome pairs
θ = number of recombined gametes/number of
gametes transmitted
Assuming that the crossing-over event for a pair of
chromosomes follows Poisson’s law, and knowing that
a parental gamete has zero or an even number of
crossings-over, whereas a recombined gamete has an
odd number, we can show that the frequency of
recombined gametes is always equal to or lower than
that of the parenteral gametes and so 0 ≤ θ < 1/2
If θ = 1/2, then all the gamete types have the same
probability and the alleles at the loci A and B loci are
transmitted independently. Loci A and B are therefore
said not to enhibit genetic linkage. This is the situation
if A and B are on different pairs of chromosomes, and
also if A and B are one the same pair, but at some
distance from each other. However, if θ < 1/2, then the
two loci are genetically linked. For a couple of which
the genotypes at the A and B are known, the probability
of observing the genotypes of the offspring depends on
the value of θ. Let us assume the following crossing:
Figure 1
In this case, the four gametes all have the same
probability: 1/4.
2- The loci A and B are on the same chromosome pairs
Here we have to distinguish between two possible
situations: the alleles A1 and B1 may be on the same
chromosome within the pair, in which case A1 and B1
are said to be "coupled"; or they may be on different
chromosomes, in which case A1 and B1 are said to be in
a state of "repulsion".
Figure 2
For instance, let us suppose that A1 and B1 are
"coupled". Four types of gametes are still produced.
Figure 5
Therefore, such a couple can have 4 types of offspring
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
324
Genetic Linkage Analysis
Clerget-Darpoux F
Take a family of which we know the genotypes at the
A and B loci of each of the members. Let L(θ) be the
liklihood of a recombination fraction 0 ≤ θ < 1/2
L(1/2) be the liklihood of θ = 1/2, that is of
independent segregation into A and B.
The lod score of the family in θ is:
Z(θ) = log10 [L(θ)/L(1/2)]
Z can be taken to be a function of θ defined over the
range [0,1/2].
Lod score of a sample of families
The liklihood of a value of θ for a sample of
independent families is the product of the liklihoods of
each family, and so the lod score of the whole sample
will be the sum of the lod scores of each family.
Figure 6
Assuming that there is gamete equilibrium at the A and
B loci, in parent 1 there is a probability of 1/2 that
alleles A1 and B1 will be coupled, and a probability of
1/2 that they will be in repulsion.
(1) A1 and B1 are coupled, so the probability that
parent (1) provides the gametes A1B1 and A2B2 is (1θ)/2 and the probability that this parent provides
gametes A1B2 and A2B1 is θ/2. The probability that the
couple will have child of type (1) or (2) is (1-θ)/2, and
that of their having a type (3) or type (4) child is θ/2.
The probability of finding n1 children of type (1), n2 of
type (2), n3 of type (3) and n4 of type (4) is therefore
[(1- θ)/2]n1+n2 x (θ/2)n3+n4
(2) A1 and B1 are in a state of repulsion, so the
probability that parent (1) provides the gametes A1B2
and A2B1 is (1-θ)/2 and the probability that this parent
provides gametes A1B1 and A2B2 is θ/2.
The probability of the previous observation is
therefore:
(θ/2)n1+n2 x[(1-θ)/2]n3+n4
So in the end, with no additional information about the
A1 and B1 phase, and assuming that the alleles at the A
and B loci are in a state of coupling equilibrium, the
probability of inding n1, n2, n3 and n4 children in
categories (1), (2), (3), (4) is: p(n1,n2,n3,n4/θ)=1/2{[(1 θ)/2]n1+n2 x (θ/2)n3+n4 + (θ/2) n1+n2 x [(1-θ)/2] n3+n4} So the
liklihood of θ for an observation n1, n2, n3, n4 can be
written:
L(θ/n1,n2,n3,n4)=1/2 {[(1-θ)/2]n1+n2 (θ/2)n3+n4 + (θ/2)
n1+n2
[(1-θ)/2] n3+n4}
Special case: number of children n= 1
Regardless of the category to which this child belongs
L(θ) = 1/2 [(1-θ)/2] + 1/2 [θ/2] = 1/4
The liklihood of this observation for the family does
not depend on θ. We can say that such a family is not
informative for θ.
Informative families
An "informative family" is a family for which the
liklihood is a variable function of θ. One essential
condition for a family to be informative is, therefore,
that it has more than one child. Furthermore, at least
one of the parents must be heterozygotic.
Definition: if one of the parents is doubly heterozygotic
and the other is:
- A double homozygote, we have a backcross
- A single homozygote, we have a simple backcross
- A double heterozygote, we have a double intercross
I- 3. Test for linkage
Several methods have been proposed to detect linkage:
"U scores", were suggested by Bernstein in 1931, "the
sib pair test" by Penrose in 1935, "likelihood ratios" by
Haldane and Smith in 1947, "the lod score method"
proposed by Morton in 1955 (1). Morton’s method is
the one most commonly used at present.
The test procedure in the lod score method is sequential
(Wald, 1947 (2)). Information, i.e. the number of
families in the sample, is accumulated until it is
possible to decide between the hypotheses H0 and H1:
H0: genetic independence θ = 1/2 and Hl: linkage of θ1
0 ≤ θ1 < ½.
The lod score of the θ1 sample Z(θ1) = log10
[L(θ1)/L(l/2)] indicates the relative probabilities of
finding that the sample is Hl or H0. Thus, a lod score of
3 means that the probability of finding that the sample
is Hl is 1000 times greater than of finding that it is H0
("lod = logarithm of the odds").
The decision thresholds of the test are usually set at -2
and +3, so that if:
Z(θ1) 3 H0 is rejected, and linkage is accepted.
Z(θ1) ≤ -2 linkage of θ1 is rejected.
-2 < Z(θ1) < 3 it is impossible to decide between H0
and Hl. It is necessary to go on accumulating
information.
For the thresholds chosen, -2 and +3, we can show that:
The first degree error, α < 10-3
The second degreee error, β < 10-2
The reliability, 1-ρ > 0.95 ∀ θ1
The power, P(θ) > 0.80 ∀ θ1 if the true value of θ <
0.10
I- 2. Definition of the "lod score" of a
family
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
Figure 7
325
Genetic Linkage Analysis
Clerget-Darpoux F
Details about the principle underlying the test are to be
found in Wald (2), and the justification for criteria -2
and +3 in Morton (1).
In fact, what is being tested is not a single value of θ1
relative to θ = 1/2, but a whole set of values between 0
and 1/2, with a step of various size (0.01 or 0.05). If
there is a value of θ1 such that Z(θ1) = 3: linkage is
concluded to exist.
Figure 10
The proposed test has the advantage of being very
simple, and of providing protection against falsely
concluding linkage. However, some criticisms can be
levelled, not only against the criteria chosen (Chotai
(3)), but also against the entire principle of using a
sequential procedure (Smith (4)). The number of
families typed is, indeed, rarely chosen in the light of
the test results.
Figure 8
If there is a value of θ1 such that Z(θ1) = -2
The linkage is excluded for any θ ≤ θ1
If ∀ θ -2 < Z(θ) < 3, no conclusion can be drawn, the
sample is not sufficiently informative.
Figure 9
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
326
Genetic Linkage Analysis
Clerget-Darpoux F
configuration, weighting it by the probability of this
configuration, and knowing the phenotypes of
individuals in A and B.
Knowledge of the genetic parameters at each of the loci
(gene frequency, penetration values) is therefore
necessary before we can estimate θ (Clerget-Darpoux
et al (5)). It is obvious that calculating the lod scores,
despite being simple in theory, is in fact a lengthy and
tedious business. In 1955, Morton provided a set of
tables giving the lod scores for various values of θ for a
disease locus and a marker locus for nuclear families
with sibling sizes of 2 to 7. However, the situations
envisaged were very restrictive. In particular, it was
assumed that the disease was determined by a dominant
or recessive completely pentrating rare gene.
"LIPED" written by Ott in 1974 (6) was the pioneering
software in linkage analysis. It is able to carry out this
calculation, in an extensive pedigree for any values of
q, f1, f2, f3 and for penetration as a function of age. The
"Linkage" program of Lathrop et al, 1984 (7,8) is the
one most often used for gene mapping. It can be used to
carry out multipoint analysis.
All the software we have described is based on the
same recursive algorithm, r (Elston and Stewart), which
means that it can be used to investigate pedigrees of
any size, but that it envisages all the possible
haplotypical combinations of markers, and is therefore
limited by the number of markers to be taken into
account. In contrast, "Genehunter" (9), which is based
on a Markov chain principle, is limited not by the
number of markers taken into consideration in the
analysis, but by the size of the family structure. The
very recently developed software package "Allegro"
(10) can apply information from a large number of
markers and extended family structures.
Analysis of gene linkage has made it possible to
construct a gene map by locating the new
polymorphisms relative to one other on the genome.
The measurement used on the gene map is not the
recombination fraction, which is not an additive datum,
but the gene distance, which we will define below.
I- 4. Estimation of the recombination
fraction
If the test, on a sample of the family, has demonstrated
linkage between the A and B loci, then one may want
to estimate the recombination fraction for these loci.
The estimated value of θ is the value which maximizes
the function of the lod score Z, and this is equivalent to
taking the value of θ for which the probability of
observing linkage in the sample is greatest.
I- 5. Recombination fraction for a disease
locus and a marker locus
Let us assume we are dealing with a disease carried by
a single gene, determined by an allele, g0, located at a
locus G (g0: harmful allele, G0: normal allele). We
would like to be able to situate locus G relative to a
marker locus T, which is known to occupy a given
locus on the genome. To do this, we can use families
with one or several individuals affected and in which
the genotype of each member of the family is known
with regard to the marker T. In order to be able to use
the lod scores method described above, what is needed
Figure 11
is to be able to extrapolate from the phenotype of the
individuals (affected, not affected) to their genotype at
locus G (or their genotypical probability at locus G).
What we need to know is:
1. the frequency, g0
2. the penetration vector f1, f2,f3
f1 = proba (affected /g0g0)
f2 = proba (affected /g0G0)
f3 = proba (affected /G0G0)
It will often happen that the information available for
the marker is not also genotypic, but phenotypic in
nature. Once again, all possible genotypes must be
envisaged.
As a general rule, the information available about a
family concerns the phenotype. To calculate
thelikelihood of θ, we must envisage all the possible
genotype configurations at each of the loci, for this
family, writing the likelihood of θ for each
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
I- 6. Linkage analysis for three loci : the
phenomenon of interference
(V. Bailey, 1961)
Now let us consider three loci A, B and C. Let the
recombination fraction between A and B be θ1, that
between B and C be θ2 and that between A and C be θ3.
Figure 12
327
Genetic Linkage Analysis
Clerget-Darpoux F
Let us consider the double recombinant event, firstly
between A and B, and secondly between B and C. Let
Rl2 be the probability of this event. If the crossings-over
occur independently in segments AB and BC, then:
Rl2 = θ1θ2
If this is not the case, an interference phenomenon is
occurring and Rl2 = C θ1 θ2 where C 1
If C < 1 the interference is said to be positive; and
crossings-over in segment AB inhibit those in segment
BC.
If C >1 the interference is said to be negative; and
crossings-over in segment AB promote those in
segment BC.
Let us consider the case of a triple heterozygotic
individual.
Such an individual can provide 8 types of gametes.
x(θ) = -1/2 Log (1-2θ) is an additive measurement.
It is known as the genetic distance, and is measured in
Morgans. It can be shown that x measures the mean
number of crossings-over.
Test for the presence of interference
Let us consider a sample of families with the genotypes
A, B and C. Let Lc be the greatest likelihood for θ1, θ2,
θ3 and L1 the greatest likelihood when we impose the
constraint C=1 (i.e. θ3 = θ1 + θ2 - 2θ1θ2)
Then -2 Log (Ll/Lc ) follows a χ2 pattern, with one
degree of freedom.
I- 7. References
Figure 13
Figure 14
1.
Morton NE. Sequential tests for detection of linkage. Am J
Hum Genet 1955; 7: 277-318.
2.
Wald A. Sequential analysis. New York: Wiley,1977.
3.
Chotai J. On the lod score method in linkage analysis.
Ann Hum Genet 1984; 48: 359-378.
4.
Smith CAB. Some comments on the statistical methods
used in linkage investigations. Am J Hum Genet 1959; 11:
289-304.
5.
Clerget-Darpoux F.; Bonaïti-Pellié C, Hochez J. Effects of
mispecifying genetic parameters in lod score analysis.
Biometrics 1986; 42: 393-399.
6.
Ott, J. Estimation of the recombination fraction in human
pedigrees: Efficient computation of the likelihood for
human linkage studies. Am J Hum. Genet 1974; 36: 363386.
7.
Lathrop GM, Lalouel, J. Easy calculations of lod scores
and genetic risks on small computers. Am J Hum Genet
1984; 36(2): 460-465
8.
Lathrop GM; Lalouel JM; Julier C; Ott J. Multilocus linkage
analysis in humans. Detection of linkage and estimation of
recombination. Am J Hum Genet 1985; 37: 482-498.
9.
Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES.
Parametric and Nonparametric Linkage Analysis: A
Unified Multipoint Approach. Am J Hum Genet 1996; 58:
1347-1363.
10. Gudbjartsson DF, Jonasson K, Frigge M, Kong A. Allegro,
a new computer program for multipoint linkage analysis.
Nature Genet 2000; 25: 12-13
11. Bailey N. Introduction to the mathematical theory of
genetic linkage. London: Oxford University Press, Amen
House,1961.
12. Ott, J. Analysis of human genetic linkage. Johns Hopkins
University Press, 1985.
13. Morton NE. The detection and estimation of linkage
between the genes for elliptocytosis and the Rh blood
type. Am J Hum 1956; 8: 80-96.
14. Smith CAB. Testing for heterogeneity of recombination
fractions in human genetics. Ann Hum Genet 1963; 27:
175-182.
Figure 15
We can write that
θ3 = θ1 + θ2 -2 R12
θ3 = θ1 + θ2 -2 Cθl θ2
If C = 1 θ3 = θ1 + θ2- 2θ1θ2
The recombination fraction is a non-additive
measurement. However, we can write
(1-2θ3) = (1-2θ1)(1-2θ2)
if x(θ) = k Log (1-2θ)
then we have x(θ3) = x(θ1) + x(θ2)
and for k = -1/2, x(θ)∼θ for small values of θ.
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
IIGenetic
localization
heterogeneity
of
The analysis of genetic linkage can be complicated by
the fact that mutations of several genes, located at
different places on the genome, can give rise to the
same disorder. This is known as genetic heterogeneity
of localization. One of the following two tests is used
to identify heterogeneity of this type, the "Predivided
328
Genetic Linkage Analysis
Clerget-Darpoux F
sample test" or the "Admixture Test". The first test is
usually only appropriate if there is a good family
stratification criterion or if each family individually has
high informativity.
II- 3. Generalization of the "admixture
test"
In some single-gene diseases, several genes have been
shown to exist at different locations. This is true, for
example of multiple exostosis disease, for which 3
genes have been identified successively on 3 different
chromosomes. The "admixture test" is then extended to
determine the proportion of families in which each of
the three genes is implicated (Legeai-Mallet et al,
1997), and the possibility that there is a fourth gene.
The three locations on chromosomes 8, 19 and 11 were
reported as El, E2 and E3, and the proportions of
families concerned as αl, α2 and α3 respectively. α4 was
used to represent the proportion of the families in
which another location was involved.
For each family i of the sample, the likelihood was
calculated using the observed segregation within the
family of the markers available in each of the three
regions, according to the clinical status of each of its
members.
Li(El, E2, E3,αl, α2, α3/Fi) = αl (L(E1/Fi)/L(El=1/2/Fi)]
+ αl(L(E2/Fi)/L(E2=1/2/Fi)] + α3 [L(E3/Fi)/L(E3=1/2/
Fi)]+ α4.
For all the families
L(El, E2, E3,αl, α2, α3/ ΠFt) = i Li(El, E2, E3,αl, α2, α3
/ Fi)
Each αi can be tested to see if it is equal to 0, and then
the corresponding non nullα i and Ei values are
estimated.
It is also possible to calculate the probability that the
gene implicated is at El, E2 or E3 for each of the
families in the sample. The post hoc probability makes
use of the estimated αi proportions, but also the specific
observations in this family.
The sample investigated has been shown to consist of
three types of families: in 48% of families, the gene is
located on chromosome 8, in 24% of them on
chromosome 19, and in 28% of families the gene is
located on chromosome 11. There was no evidence of a
fourth location in this sample. The post hoc
probabilities of belonging to one of these 3 sub-groups
were then estimated: the probability that the gene
implicated would be on chromosome 8 was over 90%
for 5 families, that it would be on chromosome 19 for 3
of them, and that it would be on chromosome 11 for 4
families. For the other families, the situation was less
clear-cut: the post-hoc probabilities are similar to the
ad hoc probabilities because of the paucity of
information provided by the markers used.
II- 1. The "Predivided sample test"
This test is intended to demonstrate linkage
heterogeneity in different sub-groups of a sample of
families. The aim is to test whether the genetic linkage
between a disease and its marker(s) is the same in all
sub-groups. These groups are formed ad hoc on the
basis of clinical or geographical criteria etc....
Let us assume that the total sample of families has been
divided into n sub-groups (it is possible to test for the
existence of as many sub-groups as families). θi denotes
the true value of the recombination fraction of subgroup i. We want to test the null hypothesis H0: θ1=
θ2= θ3= …= θn against the alternative hypothesis Hl:
the values of θi are not all equal. Therefore, the
quantity
Figure 16
follows a χ distribution2 with (n-l) degrees of freedom.
The homogeneity of the sample for linkage with a typeI error of the sample for linkage with a type I error
equal to α if Q is above the critical threshold χ2(n-l)
corresponding to α.
II- 2. The "Admixture Test"
Unlike the previous test, the "admixture test" is not
based on an ad hoc subdivision of the families. It is
assumed that among all the families studied genetic
linkage between the disease and the marker is found
only in a proportion α of the families, with a
recombination fraction θ < 1/2. In the remaining (l-α)
families, it is assumed that there is no linkage with the
marker (θ=1/2).
For each family i of the sample, the likelihood is
calculated
Li(α, θ) = α Li(θ) + (l-α) Li(1/2), where Li(θ) is the
likelihood of θ for family i. The likelihood of the
couple, (α, θ) is defined by the product of the
likelihoods associated with all the families: L(α,θ)= Πi
Li(α,θ).
We test to find out whether α is significantly different
from 1 by comparing Lmax(α = l,θ), the maximized
likelihood for θ assuming homogeneity, and Lmax(α,θ),
the maximized likelihood for the two parameters α and
θ (nested models).
Then variable Q =2[Ln Lmax (α,θ) —Ln Lmax (α= 1,θ)],
follows a χ2 distribution with one degree of freedom.
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
II- 4. References
1.
329
Legeai-Mallet L, Margaritte-Jeannin P, Clerget-Darpoux F
et al. Genetic heterogeneity of hereditary multiple
exostoses. Hum Genet 1997; 99: 298-302.
Genetic Linkage Analysis
Clerget-Darpoux F
2.
Morton N. The detection and estimation of linkage
between the genes for elliptocytosis and the Rh blood
type. Am J Hum Genet 1956; 8: 80-96.
3.
Smith CAB. Testing for heterogeneity of recombination
values in human genetics. Ann Hum Genet 1963; 27: 175182.
III- Statistical properties
method of lod scores
of
The conditions of application which underlie these
properties: sequentiality, segregation of a simple
single-gene disease in nuclear families, in which all the
members are genotyped for a genetic marker, and the
non-ambiguity of the test is not confirmed in practice.
The table below shows the change in these conditions
of application. We discuss here the impact of these
changes on the statistical properties.
the
III- 1. The test procedure
The test procedure used in the method of lod scores is
sequential (Wald, 1947). The amount of information,
i.e. the number of families is accumulated in the
sample, until it is possible to decide between the H0
and H1 hypotheses:
H0: genetic independence θ = ½ and
H1: linkage to θ1, 0 ≤ θ1 < 1/2
The value of the lod score of the sample in θ1
z(θ1) = log10 [L(θ1)/L(1/2)] indicates the relative
probabilities of observing the sample as H1 or H0.
Thus, a lod score of 3 implies that the probability is
1000 times greater of observing the sample as H1
rather than H0 ("lod=logarithm of the odds").
The decision thresholds of the test are usually set at -2
and +3, so that if:
Z(θ1) 3 H0 is rejected and linkage is concluded
Z(θ1)≤ 2 linkage is rejected for θ1.
-2 < Z(θ1) < 3 it is impossible to decide between H0
and H1.
It is necessary to go on accumulating information.
For the -2 and +3 thresholds selected, it can be shown
that:
The first degree error α < 10-3
The second degree error β < 10-2
The reliability 1-ρ > 0.95 ∀θ1
The power P(θ) > 0.80 ∀θ1 if the true value of θ < 0.10
III- 1.1. Impact of non-sequentiality
In general, one is working on a sample of families of a
fixed size. This problem of non-sequentiality was
raised by Smith (1959) and investigated by Chotai
(1984) and Guihenneuc (1991), who have shown that
the type-1 error of the test was not increased, but on the
contrary reduced.
Furthermore, the power will obviously depend on the
size the sample. It also depends on the parameters of
the genetic model (penetrations, frequency of the
morbid allele, degree of dominance), of the types of
family analysed (nuclear or extensive families), the
informativity of the markers, of what is known about
the phase of the alleles at the disease locus and the
marker locus, and of the value of the recombination
fraction between these two loci.
If one knows all about the genetic model of the
transmission of the disease and its parameters, the
greater the power of the method, the easier it is to
detect the presence of recombination between the
disease locus and a marker locus, in other words, the
genotype of each of the two loci, but also the
haplotype, i.e. the combination of 2 alleles from each
locus on the same chromosome segment are easily
identifiable from the phenotype. At the disease locus,
the genotype can be deduced unambiguously from the
phenotype if there is a rare dominant gene with total
penetrance for the heterozygote and zero penetrance for
the normal homozygote (no phenocopy). The power
diminishes as the degree of dominance and the
penetrance decline, and the gene frequency and
proportion of phenocopies increase (Ott, 1991).
Figure 17
Figure 18
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
330
Genetic Linkage Analysis
Clerget-Darpoux F
At the marker locus, this power is greater the higher the
degree of heterozygotism, or in other words, the more
polymorphic the marker. If we consider the two loci
together, the amount of knowledge about the haplotype
transmitted is greater if there are a large number of
generations. Finally, the proximity of the two loci
increases the power of detection of the genetic linkage.
Multipoint linkage analysis, which uses several
reference markers near to each other on a given
chromosome segment, increases the power of the
method by increasing the informativity of the meioses.
In general, it is used to pinpoint the location of a
morbid locus once it has been established that genetic
linkage is present.
III- 1.2. Maximization of the lod score over the [0,
1/2] interval
(Ref: Génin E., Ann Hum Genet,1995,59:123-132)
However, in practice, the test is never carried out for a
single value of θ1, but is done as follows: the lod score
is calculated for various values of θ1, the maximum lod
score Zmax is calculated and the test is applied to Zmax
.A criterion of +3 or even less, is used to conclude that
linkage is occurring, based on the argument that risk
remains sufficiently small. The probability of the posthoc non linkage is never calculated.
The fact of considering an alternative hypothesis by
using the maximum lod score, Zmax (which amounts to
testing H0: θ = 1/2 versus H1: θ < 1/2) actually reduces
the reliability of the test considerably. Thus, the
probability ρ that there is no linkage when a Zmax of + 3
has been obtained can be as high as 16.4%; i.e. more
than three times the probability calculated by Morton
(1955).
for a dominant disease in a sample of nuclear families
with two children). Reliability =1-ρ.
The example of the conflicting results obtained for
Alzheimer’s disease is a good illustration of the
usefulness of calculating the probability of linkage post
hoc. Alzheimer’s disease is a form of dementia
characterized by loss of memory and of cognitive
function. Only a few families have multiple cases, but
within this sub-group of families, the distribution of the
patients is compatible with the hypothesis of the
intervention of a dominant mutation on an autosomal
gene. Analyses of genetic linkage by the method of lod
scores were therefore carried out to localize the gene
involved. In 1987, a maximum lod score of +2.46 was
obtained using a marker of chromosome 21 in a large
genealogy with numerous members affected (family
FAD4), and this at first led people to conclude that the
mutation responsible was located on chromosome 21
(St Georges-Hyslop et coll. 1987). For many years,
research into this disease was therefore focused on this
chromosome. Five years later however, several
different teams provided a very significant
demonstration of linkage with chromosome 14
markers. The very high lod scores that were obtained
showed that most of the early familial forms were due
to a mutation of a chromosome 14 gene 14
(Schellenberg et coll. 1992, St Georges-Hyslop et coll.
1992). In particular, in the case of family FAD4, a lod
score of +5.21 was obtained with markers for this
region. In view of the observations obtained for
chromosome 21 markers in FAD4, the post-hoc
probability that there was no linkage was 1/3. It is
likely that if this calculation had been done in 1987, the
existence of a mutation on chromosome 21 in this
family would have looked less convincing.
Furthermore, it has now been shown that the gene
implicated is located on chromosome 14.
III- 1.3 References
The table below shows the probability that linkage does
not exist as a function of the Zmax obtained.
1.
2.
3.
4.
Génin E, Martinez M, Clerget-Darpoux F. Posterior
probability of linkage and maximal lod score. Ann Hum
Genet 1995; 59: 123-132.
Schellenberg GD, Bird T, Wijsman E et al. Genetic
linkage evidence for a Familial Alzheimer's disease locus
on chromosome 14. Science 1992; 258: 668-671.
St Georges-Hyslop PH, Haines J, Rogaev E et al. Genetic
evidence for a novel familial Alzheimer's disease locus on
chromosome 14. Nature Genet 1992; 2: 330-334.
St Georges-Hyslop PH, Tanzi RE, Polinsky RJ et al. The
genelic defect causing Alzheimer's disease maps on
chromosome 21. Science 1987; 235: 885-890.
III-2. Genotype information
III-2.1.
Ambiguity
in
phenotype-genotype
relationships at the disease locus
The original lod score method was applied to the study
of nuclear families (the parents and their children), and
this made it easy to deduce the genotype at each of the
loci for each member of the family. Since it is the
Figure 19
The relationship between ρ and Zmax depends on the
type of family structure and the determinism of the
disease (in this case the calculation has been carried out
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
331
Genetic Linkage Analysis
Clerget-Darpoux F
are linked by the constraint of the value of the
prevalence of the disease within the population.
III-2.2. Ambiguity in the marker genotype
To calculate a lod score between a disease locus and a
marker locus, it is necessary to take into consideration
all the possible genotypical configurations at each of
the loci and to write the probabilities of these
configurations. If some individuals have not been
genotyped for the genetic marker, the probability of
each possible genotype must be calculated. To do this,
is will be necessary to specify the allele frequencies of
the marker.
Any error in thee allele frequencies, in particular the
under-estimation of the frequency of an allele in the
patients, artificially increases the values of the lod
score and can therefore lead to a false conclusion that
there is genetic linkage (false positives) (Ott, 1991 ;
Freimer et al, 1993; Knapp et al, 1993).
In increasingly frequent use of very extensive
genealogies, in which only individuals of the last
generation are typed, alls for great caution in
interpreting positive results.
III-2.3. Gamete disequilibrium between alleles at the
disease locus and at the marker locus
An association between a susceptibility gene and a
marker can lead to bias in the estimation of the
recombination fraction. In particular, the "lod scores"
method specifies that there must be no selection for the
marker in the sample. However, in a context of an
association, selection based on the status of the patient
implicitly involves selection for a marker. Furthermore,
the calculation assumes that the probability for each
genetic combination is equal in the parents, and this is
not true if there is an association. In the analysis, failing
to take into account the disequilibrium existing
between disease alleles and marker alleles, induces a
very great under-estimation of the "lod score" (in other
terms, a marked reduction in the power of the linkage
test) and a very slight under-estimation of the
recombination fraction (Clerget-Darpoux, 1982).
phenotypes that can be observed this means that the
phenotype/genotype correspondence was known. In
particular, when the analysis was carried out between a
"disease" locus and a "marker" locus, the disease was
assumed to involve a single gene, due to a rare allele of
an autosomal gene, or linked to gender, with complete
penetrance (probability of being affected equal to 1 for
people carrying one copy of the allele for dominant
diseases, of two copies for recessive diseases). Gamete
equilibrium was also assumed to exist between the
alleles at the "disease" locus and the "marker" locus.
The method, the properties of which were fully
established on the basis of these hypotheses, has been
extended over the past twenty years to more varies and
complex situations, but without questioning its
underlying properties. In particular, it is applied to
diseases of which the determinism is less or even
totally unknown, which are studies in large
genealogies, of which some of the members have an
unknown phenotype. This leads us to investigate the
power of the test using various models and its
robustness to modeling errors.
It should be stressed that the "lod score", which is
thought of above all as a function of the recombination
fraction and used to estimate this variable, also depends
on the value of the genetic parameters at the disease
locus, i.e. the frequency of the alleles at this locus and
the penetrances (probabilities of being affected)
associated with each of these genotypes.
We evaluated the effects that an error in these
parameters produced in the linkage test and in
estimating the recombination fraction (Clerget-Darpoux
et coll, 1986,1992,1993).
- Loss of power: The power of detecting linkage can
be very severely reduced if there is an error
concerning the relative penetrance of each of the
genotypes: i.e. concerning the ratio of probabilities
of being affect in those who carry two copies of the
morbid allele, those who have a single copy and
those who do not carry it at all, "the phenocopies".
- False exclusion of linkage: The robustness of the
method to false specifications of the values of the
parameters is not symmetrical with regard to the two
hypotheses being tested. We have shown that the lod
score is always, greatest for the correct values of the
parameters and that it can be considerably reduced if
these have been wrongly specified. As a
consequence, an error in the values of the parameters
does not lead to a false conclusion of linkage
although it can wrongly lead to the exclusion of
linkage. This is particularly the case if the proportion
of phenocopies is underestimated.
- Bias in the recombination fraction: The estimation
of the recombination fraction is very sensitive to any
error in the value of any of the parameters. In
addition, the effects of the errors on the gene
frequency and on the penetrance values are usually
additive, because in most studies these parameters
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
III-3. The problem of multiple tests
One of the difficulties encountered in the statistical
interpretation of the analyses of the genetic linkage of
complex diseases arises in fact from the fact that in
general and with a varying degree of explicitness, the
data are subjected to multiple tests: several clinical
classifications, several genetic markers, several models,
several samples. It is quite clear that the
discontinuation criteria usually used in the lod score
test no longer have the same statistical significance
when several tests are applied simultaneously to the
same sample or to several samples. E. Thompson
(1984) has investigated this problem in the case of a
disease involving a single gene for which the genetic
linkage is tested using several markers located on
different chromosomes (and therefore independent).
332
Genetic Linkage Analysis
Clerget-Darpoux F
The situation is much more complex for multifactorial
diseases, because the multiplicity of the tests has
several types of impact and these are not independent
(Clerget-Darpoux et coll, 1990). Multiple tests could be
taken into account by readjusting the discontinuation
criterion of the lod scores test. However, on the one
hand, it is not always clear from the publications which
tests have actually been carried out, and on the other,
this can make the test too conservative. This is why we
think that the replication strategy should be favored.
If a positive result is replicated for a new sample (using
the same classification, the same marker, the same
transmission model) this provides a reliable threshold
of significance.
III-4. References
4.
Clerget-Darpoux F, Babron M.C., Bonaïti-Pellié C.
Assessing the effect of multiple linkage tests in complex
diseases. Genet Epidemiol 1990; 7: 245-253.
5.
Clerget-Darpoux F, Bonaïti-Pellié C. Strategies based on
marker information for the study of human diseases. Ann
Hum Genet 1992; 56: 145-153.
6.
Clerget-Darpoux F, Bonaïti-Pellié C. An exclusion map
covering the whole genome: a new challenge for genetic
epidemiologists ? Am J Hum Genet 1993; 52: 442-443
7.
Freimer NB, Sandkuijl LA, Blower SM. Incorrect
specification of marker allele frequencies : effect on
linkage analysis. Am J Hum Genet 1993; 56: 1102-1110.
8.
Guihenneuc C, Prum B, Clerget-Darpoux F, Bonaïti-Pellié
C. Remarques sur la méthode du lod score en génétique.
Pub Inst Stat Univ Paris 1990; 35: 19-37.
9.
Knapp M, Seuchter SA, Bauer MP. The effect of
misspccifying allele frequencies in incompletely typed
families. Genet Epidemiol 1993; 10: 413-418.
1.
Chotai J. On the lod score method in linkage analysis.
Ann Hum Genet 1984; 48: 359-378.
10. Morton NE. Sequential tests for the detection of linkage.
Am J Hum Genet 1955; 7: 277-318.
2.
Clerget-Darpoux F. Bias of the estimated recombination
fraction and lod score due to an association beween a
disease gene and a marker gene. Ann Hum Genet 1982;
46: 363-372.
11. Ott J. Analysis of human genetic linkage, 2nd ed ition.
John Hopkins University Press, 1991.
3.
12. Smith CAB. Some comments on the statistical methods
used in linkage investigations. Am J Hum Genet 1959; 11:
289-304.
Clerget-Darpoux F, Bonaïti-Pellié C, Hochez J. Effects of
misspecifying genetic parameters in 1od score analysis.
Biometrics 1986; 42: 393-399.
13. Wald A. Sequential analysis. New York: Wiley, 1947
This article should be referenced as such:
Clerget-Darpoux F. Genetic Linkage Analysis. Atlas Genet
Cytogenet Oncol Haematol. 2002; 6(4):323-333.
Atlas Genet Cytogenet Oncol Haematol. 2002; 6(4)
333