Download QTL analysis in Mouse Crosses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Karyotype wikipedia , lookup

Skewed X-inactivation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Polyploid wikipedia , lookup

Y chromosome wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

X-inactivation wikipedia , lookup

Population genetics wikipedia , lookup

Tag SNP wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Neocentromere wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
How many genes?
Mapping mouse traits, cont.
Lecture 3, Statistics 246
January 27, 2004
1
Inferring linkage and
mapping markers
We now turn to deciding when two marker loci
are linked, and if so, estimating the map distance
between them. Then we go on and create a full
(marker) map of each chromosome, relative to
which we can map trait genes. With these
preliminaries completed, we can map trait loci.
2
The LOD score
Suppose that we have two marker loci, and we don’t
know whether or not they are linked. A natural way to
address this question is to carry out a formal test of
the null hypothesis H: r=1/2 against the alternative
K: r< 1/2, using the marker data from our cross.
The test statistic almost always used in this context is
log10 of the ratio of the likelihood at the maximum
likelihood estimate rˆ to that at the null, r=1/2, i.e.
L( rˆ)
LOD  log 10{
}
L(1 / 2)
3
Calculating the LOD score
Recall that the (log) likelihood here is based on the
multinomial distribution for the allocation of n=132
intercross mice into their nine 2-locus genotypic
categories. As we saw earlier, it can be written
log10 L(r)   ni log10 pi (r)

i
and so we take the difference between this function
evaluated at rˆ and at r=1/2, which is

LOD   ni log 10 pi (rˆ) /qi
i
where
 qi is 1/16, 1/8 or 1/4, depending on i.
4
Null probabilities of 2-locus genotypes
L1 L2
A
H
A
1/16
1/8
H
1/8
1/4
B
1/16
1/8
B
1/16
1/8
1/16
This is just putting r = 1/2 in an earlier table.
Exercise: Suggest some different test statistics to discriminate between the
null H and the alternative K. How do they perform in comparison to the LOD?
5
Using the LOD score
Normal statistical practice would have us setting a type 1 error in a given
context (cross, sample size), and determining the cut-off for the LOD which
would achieve approximately the desired error under the null hypothesis.
This approach is rarely adopted in genetics, where tradition dictates the use
of more stringent thresholds, which take into a account the multiple testing
common on linkage mapping. It was originally motivated by a Bayesian
argument, and in fact, Bayesian approaches to linkage analysis are
increasingly popular. Let us use of Bayes’ formula in the form
log10 posterior odds = log10 prior odds + LOD,
where the odds are for linkage. With 20 chromosomes, which we might
assume approx the same size, and not too long, the prior probability of two
random loci being on the same chromosome and hence linked, is about
1/20. In order to overcome these prior odds against linkage, and achieve
6
reasonable posterior odds, say 100:1, we would want a LOD of at least 3.
Linkage groups
And so it has come to pass that a LOD must be >3 to get
people’s attention. We’ll be a little more precise later.
The next step is to define what are called linkage groups.
These partition the markers into classes, every pair of markers
being either closely linked (i.e. r  0), or being connected by a
chain of markers, each consecutive pair of which is closely
linked. In practice, we might define closely linked to be
something like
a)
rˆ
< c1, and b) LOD(
rˆ
) > c2, where e.g. c1= 0.2, c2 = 3.
7
Forming linkage groups, cont.
When one tries to form linkage groups, it is not unusual to have
to vary c1 and c2 a little, until all markers fall into a group of
more than just one marker. When this is done, it is hoped that
the linkage groups correspond to chromosomes. If the
chromosome number of the species is known, and that
coincides with the number of linkage groups, this is a
reasonable presumption. But much can happen to dash this
hope: one may have two linkage groups corresponding to
different arms of the same chromosome, and not know that;
one can have a marker at the end of one chromosome “linked”
to a marker at the end of another chromosome, though this
should be rare if there is plenty of data; and so on.
8
Ordering linkage groups
Next we want to order the markers in a linkage group( ideally,
on a chromosome). How do we do that? An initial ordering can
be done by starting one of the markers, M1 say, on the most
distant pair, here distance being recombination fraction, or map
distance. Call M2 the closest marker to M1 and continue in this
way.
Now we want to confirm our ordering. One way is to calculate a
(maximized) log likelihood for every ordering, and select the
one with the largest log likelihood. But if we have (say) 11
markers on a chromosome, this is 11! = 4107 orders. What
people often do is take moving k-tuples of markers, and
optimize the order of each, e.g. with k = 3 or 4. Whichever
strategy one adopts, multi (i.e. >2) locus methods are needed.
9
Likelihoods for 3-locus data
Suppose that we have 3 markers M1 , M2 and M3 in that order. How do we
calculate the log likelihood of the associated 3-locus marker data from our
intercross?
Recalling the discussion preceding the Punnett square of the last lecture,
the parental haplotypes here are a1a2a3 and b1b2b3 while are would no
fewer than 6 forms of recombinant haplotypes:
the four single recombinants a1a2b3 , a1 b2 b3 , b1b2a3 and b1a2a3 ,
and the two double recombinants a1b2 a3 and b1a2b3 .
Proceeding as before, we calculate the probability of each of these in terms
of the recombination fractions r1 and r2 across intervals M1-M2, and M2-M3,
respectively. For simplicity, we assume the Poisson model, with
independence of recombination across disjoint intervals. For example,
a1a2a3 would have probability (1- r1)(1- r2)/4, a1a2b3 would have probability
(1- r1)r2/4, while a1b2 a3 would have probability r1r2 .
We would do this for every one of the 8 paternal and 8 maternal haplotypes,
and then collect them up to assign probabilities for each of the 33 3-locus
genotypes (AAA, AAH, …, BBB), and maximize the multinomial likelihood in
10
the parameters r1 and r2 . This is just as in the 2-locus case.
Multilocus linkage: #loci >3
It should have become clear by now that the strategy just
outlined is not going to work too easily when there are (say) 11
loci in a linkage group.
In that case, haplotypes are strings of the form a1a2b3 … a10b11 ,
where there are just 2 parental and 210-2 distinct recombinant
haplotypes. The number of parental haplotype combinations is
the square of this number, and they must be mapped into 311
11-locus genotypes, and a multinomial MLE carried out to
estimate 10 recombination fractions. What can be done?
In 1987 the first large scale human genetic map was published,
and at the same time a new algorithm was announced for both
human pedigrees and experimental crosses, such as our
intercross. This algorithm made use of hidden Markov models,
and for the first time allowed full likelihood calculations in our
11
current context without the exponential blow-up just described.
Multilocus mapping: no details
I’m not going to cover this topic in detail this year, as I discussed it
a few years ago, and those interested can read it there:
www.stat.berkeley.edu/users/terry/Classes/s260.1998/index.html
We will meet hidden Markov models again pretty soon, as they
are have become a common feature of statistical genetics and
computational biology since the early 1980s.
Now suppose that we have ordered our marker loci as just
described, either by maximizing the likelihood within linkage
groups over all orders, or by doing so in moving windows of
size 3-5. How do we look at the result?
12
Top triangle is a
transform of the
recombination
fraction, namely
Checking the map, after
removal of bad markers
-4(1+log2r ).
Bottom triangle
contains the
LOD scores at
the maximum
likelihood
estimate of
recombination
fraction.
Notice the “bad”
bits in the top LH
and bottom RH
corners.
est.rf, plot.rf (from an R package)
13
Checking existing genetic maps
As indicated earlier, the markers in our cross came from MIT,
and they were already mapped. Most researchers would
simply use the pre-existing map, as this would usually (but not
always) be based on many more recombinations than could be
expected in a single cross. Why might we not just do the same?
Well, existing maps are rarely completely error-free, and one
should always look at one’s own data.
An added benefit of looking at one’s own data in relation to an
existing map is that this should bring to light markers with a
large numbers of genotyping errors, assuming the map is
correct.
14
Interplay between error
detection and maps
• Genotyping errors in mouse crosses can usually
only be detected with the appearance of unusual
numbers of close recombination events
• This depends entirely on the quality of the map
• The availability of the mouse genome sequence
allows us to check genetic maps against the
physical maps: we locate the (unique) PCR
primers for our microsatellite markers. This has
brought a new era in quality of maps (includes
human genetic maps!).
The next slide depicts the genetic map we used. 15
Locations of our markers
16
After a commercial, we move on to mapping coat color genes.
R
17
R/qtl
Authors: Karl Broman, Hao Wu, Gary Churchill, Saunak Sen, & Brian Yandell
18
Benefits of using R/qtl
•
•
•
•
Lots of graphics
Good error detection with accompanying graphics
Single and two qtl mapping (and interaction terms)
Choice of several input formats
– Includes Mapmaker format
• Many alternatives for mapping methods
• Many different models for phenotypes, e.g.
standard normal, nonparametric model, binary
traits
19
Why map coat color genes in our
C57/BL6 x NOD F2 intercross?
• the locations of these genes are known
• even with a modest number of mice we should
be able to map these genes easily
• it is a useful check that everything is as it should
be with our data
• and finally, it is a good exercise for us.
Exercise. Look up the agouti and albino loci at the
Mouse Genome Informatics database.
20
Recall our earlier Punnett square
21
Segregation data at a “random” marker
Phenotype by genotype at D12Mit51
(complete data only)
Agouti
Black
White
A
19
8
9
B
18
3
7
H
35
18
12
22
Mapping a segregating trait
We turn now to mapping the two coat color genes segregating in
our cross, beginning with the albino locus, and then the agouti
locus. To do so, we need a genetic model, that is, we need to
know or guess the relation between genotypes at our trait loci
and phenotypes, which is embodied in the notion of a
penetrance function.
Looking at the preceding table, the albino trait segregates just
as though governed by a recessive gene, so we postulate a
locus with a recessive and a dominant allele for it. Although this
is not precisely the case for the non-agouti trait, it is almost, and
we do likewise.
Later we will consider their interaction.
23
Probabilities of albino-marker genotypes (4)
Recall that the NOD mouse (A) is homozygous for the albino
allele, while the C57/BL6 (B) is homozygous for the non-albino
allele. We can collapse an earlier table to get:
Colour M A
Albino
(1-r)2
Full color 1-(1-r)2
H
2r(1-r)
2 - 2r(1-r)
B
r2
1-r2
Here r is the rec. fr. between a marker and the albino locus.
24
Segregation data at the
marker closest to Tyrc
Phenotype by genotype at D7Mit126
@ 50 cM (the Tyrc locus is at 44 cM)
Agouti
Black
White
A
B
3 19
0 10
21 0
H
47
19
1
25
Mapping the albino locus
26
Plot of LOD score at each marker along the genome
Chromosome 7 genotypes for the albino mice.
A: homozygous NOD, B: homozygous B6,
H: heterozygote. Genotypes are read down.
Pale blue shading is conserved NOD haplotype.
D7Mit128 is near the Tyrc locus,
27
Honesty in advertising,
and LOD thresholds
There is more material in preparation here.
Please revisit this space in a day or so.
28
Approximate probabilities of
agouti-marker genotypes (4)
Recall that the C57/BL6 (B) is homozygous for non-agouti,
while the NOD (A) is homozygous agouti. Ignoring the 1/16 of
the intercross who would exhibit the non-agouti trait (and be
black) if they weren’t albino, we get the following approximate
table, where 1/16 of the mice will be misclassified. Here r is the
recombination fraction between a marker and the agouti locus.
Colour M A
Non-black 1-r2
Black
r2
H
2-2r(1-r)
2r(1-r)
B
1- (1-r)2
(1-r)2
29
Segregation data at the marker
closest to the agouti locus
Phenotype by genotype at D2Mit48
@ 87 cM (agouti locus is at 89 cM)
Agouti
Black
White
A
B
24 2
0 28
5
6
H
46
1
14
30
Mapping the agouti locus
31
Plot of LOD score at each marker along the genome
Chromosome 2 genotypes for the black progeny.
Mauve shading indicates conserved C57/BL6 haplotype.
Marker D2Mit48 is very close to the agouti locus.
32
Conclusion: single locus mapping
• agouti locus (A,a alleles) on Chr 2 at 89.9 cM
• albino locus (C,c alleles) on Chr 7 at 44 cM
(now known as Tyrc gene)
• In the data set:
– at 89 cM on Chr 2 with a LOD score > 20
• Marker D2M48 (8th marker on Chr 2)
– at 43 cM on Chr 7 with a LOD score > 20
• Marker D7M126 (4th marker on Chr 7)
The method worked for agouti, even though
1/16th of the mice were misclassified
33
Acknowledgement
These last 3 lectures would not have been possible
without the very substantial input of Melanie Bahlo and
Tom Brodnicki of the Walter & Eliza Hall Institute of
Medical Research, Melbourne Australia.
Tom (together with people from the WEHI mouse
facility) carried out the cross, and did all the
phenotyping, while Melanie did all the data analysis
presented, and contributed a lot to the presentation.
Overall, responsibility for the presentation (especially
all the errors!) remains mine.
34
General exercise
Go through the last 3 lectures and redo all
the calculations as you can for the case of
a backcross rather than an intercross.
You will find it all simpler, and in every case,
closed form expressions appear, where we
needed iterative methods for the intercross.
35