Download statgen9

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Medical genetics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Epistasis wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene desert wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Skewed X-inactivation wikipedia , lookup

Neocentromere wikipedia , lookup

Genetic testing wikipedia , lookup

Genetic drift wikipedia , lookup

Genome evolution wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Heritability of IQ wikipedia , lookup

X-inactivation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome-wide association study wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Gene expression programming wikipedia , lookup

Population genetics wikipedia , lookup

Designer baby wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
Estimation Of The Recombination Fraction

If the test, on a sample of the family, has demonstrated linkage between the A
and B loci, then one may want to estimate the recombination fraction for these
loci.

The estimated value of  is the value which maximizes the function of the lod
score Z, and this is equivalent to taking the value of  for which the
probability of observing linkage in the sample is greatest.
Recombination Fraction For A Disease Locus
And A Marker Locus

Let us assume we are dealing with a disease carried by a single gene,
determined by an allele, g0, located at a locus G (g0: harmful allele, G0:
normal allele).

We would like to be able to situate locus G relative to a marker locus T, which
is known to occupy a given locus on the genome. To do this, we can use
families with one or several individuals affected and in which the genotype of
each member of the family is known with regard to the marker T.

In order to be able to use the lod scores method described above, what is
needed is to be able to extrapolate from the phenotype of the individuals
(affected, not affected) to their genotype at locus G (or their genotypical
probability at locus G)

What we need to know is:
o the frequency, g0
o the penetration vector f1, f2,f3


f1 = Pr (affected /g0g0)

f2 = Pr (affected /g0G0)

f3 = Pr (affected /G0G0)
It will often happen that the information available for the marker is not also
genotypic, but phenotypic in nature. Once again, all possible genotypes must
be envisaged.

As a general rule, the information available about a family concerns the
phenotype. To calculate the likelihood of  , we must envisage all the possible
genotype configurations at each of the loci, for this family, writing the
likelihood of  for each configuration, weighting it by the probability of this
configuration, and knowing the phenotypes of individuals in A and B.

Knowledge of the genetic parameters at each of the loci (gene frequency,
penetration values) is therefore necessary before we can estimate  .
Estimation of L as a function of  and f

Allele distribution. If the frequency of D is .01, H-W equilibrium is

Pr(Dd ) = 2x.01x.99

Genotypes of founder couples are (usually) treated as independent.

Pr(Dd , dd ) = (2x.01x.99)x(.99)2
1
2
Dd

dd
Pedigree analyses usually suppose that, given the genotype at all loci, and in
some cases age and sex, the chance of having a particular phenotype depends
only on genotype at one locus, and is independent of all other factors:
genotypes at other loci, environment, genotypes and phenotypes of relatives,
etc.

Complete penetrance: pr(affected | DD ) = 1

Incomplete penetrance: pr(affected | DD ) = .8
DD
DD
Dd
Dd
3
Dd
dd

5
4
DD
Assume penetrances pr(affected | dd ) = .1, pr(affected | Dd ) = .3 pr(affected |
DD ) = .8, and that allele D has frequency .01.

The probability of this pedigree is the product:

(2 x .01 x .99 x .7) x (2 x .01 x .99 x .3) x (1/2 x 1/2 x .9) x (2 x 1/2 x 1/2 x .7)
x (1/2 x 1/2 x .8)

Two-locus founder probabilities are typically calculated assuming linkage
equilibrium, i.e. independence of genotypes across loci.

If D and d have frequencies .01 and .99 at one locus, and T and t have
frequencies .25 and .75 at a second, linked locus, this assumption means that
DT, Dt, dT and dt have frequencies .01 x .25, .01 x .75, .99 x .25 and .99 x
.75 respectively. Together with Hardy-Weinberg, this implies that




pr(DdTt ) = (2 x .01 x .99) x (2 x .25 x .75)
= 2 x (.01 x .25) x (.99 x .75)
+ 2 x (.01 x .75) x (.99 x .25).
This last expression adds haplotype pair probabilities.
D d
T t
D d
T t


d d
t t
Initially, this must be done with haplotypes, so that account can be taken of
recombination. Then terms like that below are summed over possible phases.
Here only the father can exhibit recombination: mother is uninformative.

pr(kid DT/dt | pop DT/dt & mom dt/dt ) = pr(kid DT | pop DT/dt ) x pr(kid
dt | mom dt/dt )= (1-)/2 x 1.
Two Loci: Penetrance

In all standard linkage programs, different parts of phenotype are
conditionally independent given all genotypes, and two-loci penetrances split
into products of one-locus penetrances. Assuming the penetrances for DD,
Dd and dd given earlier, and that T,t are two alleles at a co-dominant marker
locus.

Pr( affected & Tt | DD, Tt ) = Pr(affected | DD, Tt ) Pr(Tt | DD, Tt )
1
= 0.8 
Dd
T t
Dd
T t
Dd
T t
dd
t t
dd
t t
Dd
t t
Pr (all data |  ) = pr(parents' data |  )  pr(kids' data | parents' data, )
= pr(parents' data)  {[((1-)/2)3  /2]/2+ [(/2)3  (1-)/2]/2}
I- 5. RECOMBINATION FRACTION FOR A DISEASE LOCUS AND A
MARKER LOCUS
Let us assume we are dealing with a disease carried by a single gene, determined by an
allele, g0, located at a locus G (g0: harmful allele, G0: normal allele).
We would like to be able to situate locus G relative to a marker locus T, which is known
to occupy a given locus on the genome. To do this, we can use families with one or
several individuals affected and in which the genotype of each member of the family is
known with regard to the marker T.
In order to be able to use the lod scores method described above, what is needed
Figure 11
is to be able to extrapolate from the phenotype of the individuals (affected, not affected)
to their genotype at locus G (or their genotypical probability at locus G). What we need
to know is:
1. the frequency, g0
2. the penetration vector f1, f2,f3
f1 = proba (affected /g0g0)
f2 = proba (affected /g0G0)
f3 = proba (affected /G0G0)
It will often happen that the information available for the marker is not also genotypic,
but phenotypic in nature. Once again, all possible genotypes must be envisaged.
As a general rule, the information available about a family concerns the phenotype. To
calculate thelikelihood of , we must envisage all the possible genotype configurations at
each of the loci, for this family, writing the likelihood of  for each configuration,
weighting it by the probability of this configuration, and knowing the phenotypes of
individuals in A and B.
Knowledge of the genetic parameters at each of the loci (gene frequency, penetration
values) is therefore necessary before we can estimate  (Clerget-Darpoux et al (5)).
It is obvious that calculating the lod scores, despite being simple in theory, is in fact a
lengthy and tedious business. In 1955, Morton provided a set of tables giving the lod
scores for various values of  for a disease locus and a marker locus for nuclear families
with sibling sizes of 2 to 7. However, the situations envisaged were very restrictive. In
particular, it was assumed that the disease was determined by a dominant or recessive
completely pentrating rare gene.
"LIPED" written by Ott in 1974 (6) was the pioneering software in linkage analysis. It is
able to carry out this calculation, in an extensive pedigree for any values of q, f1, f2, f3 and
for penetration as a function of age.
The "Linkage" program of Lathrop et al, 1984 (7,8) is the one most often used for gene
mapping. It can be used to carry out multipoint analysis.
All the software we have described is based on the same recursive algorithm, r (Elston
and Stewart), which means that it can be used to investigate pedigrees of any size, but
that it envisages all the possible haplotypical combinations of markers, and is therefore
limited by the number of markers to be taken into account.
In contrast, "Genehunter" (9), which is based on a Markov chain principle, is limited not
by the number of markers taken into consideration in the analysis, but by the size of the
family structure.
The very recently developed software package "Allegro" (10) can apply information from
a large number of markers and extended family structures.
Analysis of gene linkage has made it possible to construct a gene map by locating the
new polymorphisms relative to one other on the genome. The measurement used on the
gene map is not the recombination fraction, which is not an additive datum, but the gene
distance, which we will define below.
I- 6. LINKAGE ANALYSIS FOR THREE LOCI : THE PHENOMENON OF
INTERFERENCE
(V. Bailey, 1961)
Now let us consider three loci A, B and C. Let the recombination fraction between A and
B be 1, that between B and C be 2 and that between A and C be 3.
Figure 12
Let us consider the double recombinant event, firstly between A and B, and secondly
between B and C. Let Rl2 be the probability of this event. If the crossings-over occur
independently in segments AB and BC, then:
Rl2 = 12
If this is not the case, an interference phenomenon is occurring and
Rl2 = C 1 2 where C !=1
If C  1 the interference is said to be positive; and crossings-over in segment AB inhibit
those in segment BC.
If C 1 the interference is said to be negative; and crossings-over in segment AB
promote those in segment BC.
Let us consider the case of a triple heterozygotic individual.
Such an individual can provide 8 types of gametes.
Figure 13
Figure 14
Figure 15
We can write that
3 = 1 + 2 -2 R12
3 = 1 + 2 -2 Cl 2
If C = 1 3 = 1 + 2- 212
The recombination fraction is a non-additive measurement. However, we can write
(1-23) = (1-21)(1-22)
if x() = k Log (1-2)
then we have x(3) = x(1) + x(2)
and for k = -1/2, x() for small values of .
x() = -1/2 Log (1-2) is an additive measurement.
It is known as the genetic distance, and is measured in Morgans. It can be shown that x
measures the mean number of crossings-over.
Test for the presence of interference
Let us consider a sample of families with the genotypes A, B and C. Let Lc be the
greatest likelihood for 1, 2, 3 and L1 the greatest likelihood when we impose the
constraint C=1
(i.e. 3 = 1 + 2 - 212)
Then -2 Log (Ll/Lc ) follows a  pattern, with one degree of freedom.
II- GENETIC HETEROGENEITY OF LOCALIZATION
The analysis of genetic linkage can be complicated by the fact that mutations of several
genes, located at different places on the genome, can give rise to the same disorder. This
is known as genetic heterogeneity of localization. One of the following two tests is used
to identify heterogeneity of this type, the "Predivided sample test" or the "Admixture
Test". The first test is usually only appropriate if there is a good family stratification
criterion or if each family individually has high informativity.
II- 1. THE PREDIVIDED SAMPLE TEST
This test is intended to demonstrate linkage heterogeneity in different sub-groups of a
sample of families. The aim is to test whether the genetic linkage between a disease and
its marker(s) is the same in all sub-groups. These groups are formed ad hoc on the basis
of clinical or geographical criteria etc....
Let us assume that the total sample of families has been divided into n sub-groups (it is
possible to test for the existence of as many sub-groups as families). i denotes the true value of
the recombination fraction of sub-group i.
1= 2= 3= …=n against the alternative hypothesis Hl: the
values of i are not all equal.
We want to test the null hypothesis H0:
Therefore, the quantity
Figure 16
follows a  distribution with (n-l) degrees of freedom. The homogeneity of the sample
for linkage with a type-I error of the sample for linkage with a type I error equal to  if Q
is above the critical threshold (n-l) corresponding to .
II- 2. THE ADMIXTURE TEST
Unlike the previous test, the "admixture test" is not based on an ad hoc subdivision of the
families. It is assumed that among all the families studied genetic linkage between the
disease and the marker is found only in a proportion  of the families, with a
recombination fraction 1/2. In the remaining (l-) families, it is assumed that there is
no linkage with the marker (=1/2).
For each family i of the sample, the likelihood is calculated
Li() =  Li() + (l-) Li(1/2),
where Li() is the likelihood of  for family i. The likelihood of the couple () is
defined by the product of the likelihoods associated with all the families :
L()= i Li()
We test to find out whether  is significantly different from 1 by comparing Lmax(=
l,), the maximized likelihood for  assuming homogeneity, and Lmax(), the
maximized likelihood for the two parameters  and  (nested models).
Then variable Q =2[Ln Lmax () —Ln Lmax (= 1,)]
follows a  distribution with one degree of freedom.
II- 3. GENERALIZATION OF THE ADMIXTURE TEST
In some single-gene diseases, several genes have been shown to exist at different
locations. This is true, for example of multiple exostosis disease, for which 3 genes have
been identified successively on 3 different chromosomes. The "admixture test" is then
extended to determine the proportion of families in which each of the three genes is
implicated (Legeai-Mallet et al, 1997), and the possibility that there is a fourth gene.
The three locations on chromosomes 8, 19 and 11 were reported as El, E2 and E3, and the
proportions of families concerned as l, 2 and 3 respectively. 4 was used to represent
the proportion of the families in which another location was involved.
For each family i of the sample, the likelihood was calculated using the observed
segregation within the family of the markers available in each of the three regions,
according to the clinical status of each of its members.
Li(El, E2, E3,l, 2, 3 / Fi) = l (L(E1/Fi)/L(El=1/2 / Fi)] + l(L(E2/Fi)/L(E2=1/2 / Fi)]
+
3
[L(E3/Fi)/L(E3=1/2 / Fi)]+ 4
For all the families
L(El, E2, E3,l, 2, 3/ Ft) = i Li(El, E2, E3,l, 2, 3 / Fi)
Each i can be tested to see if it is equal to 0, and then the corresponding non null i and
Ei values are estimated.
It is also possible to calculate the probability that the gene implicated is at El, E2 or E3
for each of the families in the sample. The post hoc probability makes use of the
estimated i proportions, but also the specific observations in this family.
The sample investigated has been shown to consist of three types of families: in 48% of
families, the gene is located on chromosome 8, in 24% of them on chromosome 19, and
in 28% of families the gene is located on chromosome 11. There was no evidence of a
fourth location in this sample.
The post hoc probabilities of belonging to one of these 3 sub-groups were then estimated:
the probability that the gene implicated would be on chromosome 8 was over 90% for 5
families, that it would be on chromosome 19 for 3 of them, and that it would be on
chromosome 11 for 4 families. For the other families, the situation was less clear-cut: the
post-hoc probabilities are similar to the ad hoc probabilities because of the paucity of
information provided by the markers used.
III- 1.2. MAXIMIZATION OF THE LOD SCORE OVER THE [0, 1/2] INTERVAL
(E. Génin, Ann Hum Genet,1995,59:123-132)
However, in practice, the test is never carried out for a single value of 1, but is done as
follows: the lod score is calculated for various values of 1, the maximum lod score Zmax
is calculated and the test is applied to Zmax .A criterion of +3 or even less, is used to
conclude that linkage is occurring, based on the argument that the  risk remains
sufficiently small. The probability of post-hoc non linkage is never calculated.
The fact of considering an alternative hypothesis by using the maximum lod score, Zmax
(which amounts to testing H0:  = 1/2 versus H1: 1/2) actually reduces the reliability
of the test considerably. Thus, the probability  that there is no linkage when a Zmax of +
3 has been obtained can be as high as 16.4%; i.e. more than three times the probability
calculated by Morton (1955).
The table below shows the probability that linkage does not exist as a function of the Zmax
obtained.
Figure 19
the relationship between  and Zmax depends on the type of family structure and the
determinism of the disease (in this case the calculation has been carried out for a
dominant disease in a sample of nuclear families with two children). Reliability =1-
The example of the conflicting results obtained for Alzheimer’s disease is a good
illustration of the usefulness of calculating the probability of linkage post hoc.
Alzheimer’s disease is a form of dementia characterized by loss of memory and of
cognitive function. Only a few families have multiple cases, but within this sub-group of
families, the distribution of the patients is compatible with the hypothesis of the
intervention of a dominant mutation on an autosomal gene. Analyses of genetic linkage
by the method of lod scores were therefore carried out to localize the gene involved. In
1987, a maximum lod score of +2.46 was obtained using a marker of chromosome 21 in a
large genealogy with numerous members affected (family FAD4), and this at first led
people to conclude that the mutation responsible was located on chromosome 21 (St
Georges-Hyslop et coll. 1987). For many years, research into this disease was therefore
focused on this chromosome. Five years later however, several different teams provided a
very significant demonstration of linkage with chromosome 14 markers. The very high
lod scores that were obtained showed that most of the early familial forms were due to a
mutation of a chromosome 14 gene 14 (Schellenberg et coll. 1992, St Georges-Hyslop et
coll. 1992). In particular, in the case of family FAD4, a lod score of +5.21 was obtained
with markers for this region. In view of the observations obtained for chromosome 21
markers in FAD4, the post-hoc probability that there was no linkage was 1/3. It is likely
that if this calculation had been done in 1987, the existence of a mutation on chromosome
21 in this family would have looked less convincing. Furthermore, it has now been shown
that the gene implicated is located on chromosome 14.
III-3. THE PROBLEM OF MULTIPLE TESTS
One of the difficulties encountered in the statistical interpretation of the analyses of the
genetic linkage of complex diseases arises in fact from the fact that in general and with a
varying degree of explicitness, the data are subjected to multiple tests: several clinical
classifications, several genetic markers, several models, several samples. It is quite clear
that the discontinuation criteria usually used in the lod score test no longer have the same
statistical significance when several tests are applied simultaneously to the same sample
or to several samples. E. Thompson (1984) has investigated this problem in the case of a
disease involving a single gene for which the genetic linkage is tested using several
markers located on different chromosomes (and therefore independent). The situation is
much more complex for multifactorial diseases, because the multiplicity of the tests has
several types of impact and these are not independent (Clerget-Darpoux et coll, 1990).
Multiple tests could be taken into account by readjusting the discontinuation criterion of
the lod scores test. However, on the one hand, it is not always clear from the publications
which tests have actually been carried out, and on the other, this can make the test too
conservative. This is why we think that the replication strategy should be favored.
If a positive result is replicated for a new sample (using the same classification, the same
marker, the same transmission model) this provides a reliable threshold of significance.