Download Chapter 9 Applications of probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transfer RNA wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Gene wikipedia , lookup

Frameshift mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Chapter 9
Applications of probability
In this chapter we use the tools of elementary probability to investigate problems of several kinds.
First, we study the “language of life” by focusing on the universal genetic code. We ask how that
code (consisting of four letters, called nucleotides or bases) is distributed over the 20 or so “words”
(amino acids) that make up functional units, the proteins. We find that there are interesting
properties of that code that make some amino acids more likely to occur than others (given random
sequences of bases). Next, we turn attention to how the sentences made up of these elementary
words are shaped by enzymes that cut DNA. We then turn to genetics, and ask how traits are
inherited and passed from one generation to the next. We explore how information about family
traits can help to answer questions about the likelihood of a genetic disease being transmitted. We
end the chapter with other examples of biological and non-biological applications of the laws of
chance.
9.1
The genetic code
All living things are made of basic building blocks called proteins. These proteins come in a rich
diversity, and each of these is in turn composed of elementary units called amino acids. A protein
is created from a linear chain of such units, assembled one unit at a time in a long string. As it
is being produced, the chain folds back on itself to produce a three-dimensional structure, like a
beaded necklace that curls and folds up on itself. The “beads” (representing amino acids) come in 20
varieties. All share an identical framework, with distinct “side chains” that give them each unique
properties: some are polar, and tend to favor interactions with water, while others are hydrophobic
- i.e. favor an environment that is protected from water. Although the primary structure of the
protein constrains the overall structure that can form, the side chains influence which amino acids
like to be in proximity to which (in the folded 3D protein), and which sections of the chain form
helical or flat portions of the protein. The sequence in which these “beads” are “strung together”,
i.e. the linear sequence of amino acids, determines the final structure of the protein.
All our cells are made up of a dazzling variety of proteins. Some are structural, and some
functional “enzyme” proteins, that carry out catalytic activity. Instructions for the assembly of each
and every one of these proteins is encoded in the genetic material, i.e. in DNA (deoxyribonucleic
acid). This “instruction book” is itself a linear chain of code, composed of letters that represent
v.2005.1 - January 5, 2009
1
Math 103 Notes
A
Chapter 9
T T G
I
C
A
E
G
C A T
H
A
A
nucleotides
on DNA strand
showing 4 codons
G
amino acid
sequence on
protein
3D protein
structure
composed of
amino acid
subunits
Figure 9.1: The sequence of nucleotides on a DNA chain represents a sequence of amino acids along
the length of a protein. (DNA is actually a double-helix, with two intertwined complementary
strands, but here we show only one of these strands). The strand illustrates four typical codons,
and their corresponding amino acids (I=Isoleucine, E=Glutamic acid, H=Histidine, A=Alanine;
see the Appendix for common abbreviations of the names of the amino acids). The lower panel
shows the 3D structure formed by the folded protein chain.
the sequence of amino acids in each protein.
Let us consider one such portion of DNA that encodes a single protein, i.e. one gene. DNA is
composed of just four distinct units, called nucleotides. These four components, Adenine, Cytosine,
Guanine, and Thymine (A, C, G, T) arranged in a long “string”, one after the other, are translated
into the corresponding “necklace” of amino acids that compose a protein whenever the given gene is
activated. (We shall not here discuss the details of the amazing processes that handle this process
of translation, since our focus will be on some aspects of the “code”.) DNA nucleotide “words”
have to spell out 20 distinct meanings, corresponding to the 20 amino acids listed in Table 1. We
first consider what features of this code follow from these simple facts.
1. DNA is composed of 4 nucleotides, (A, T, G, C).
2. These nucleotides, arranged one after the other, code for 20 amino acids.
Facts 1 and 2 imply that we cannot make a simple correspondence of one nucleotide representing
one amino acid. This implies that the code must contain longer “words”. Assuming that all words
v.2005.1 - January 5, 2009
2
Math 103 Notes
T
C
A
G
Chapter 9
T
TTT Phe (F)
TTC Phe (F)
TTA Leu (L)
TTG Leu (L)
CTT Leu (L)
CTC Leu (L)
CTA Leu (L)
CTG Leu (L)
ATT Ile (I)
ATC Ile (I)
ATA Ile (I)
ATG Met (M)
GTT Val (V)
GTC Val (V)
GTA Val (V)
GTG Val (V)
C
TCT Ser (S)
TCC Ser (S)
TCA Ser (S)
TCG Ser (S)
CCT Pro (P)
CCC Pro (P)
CCA Pro (P)
CCG Pro (P)
ACT Thr (T)
ACC Thr (T)
ACA Thr (T)
ACG Thr (T)
GCT Ala (A)
GCC Ala (A)
GCA Ala (A)
GCG Ala (A)
A
TAT Tyr (Y)
TAC Tyr (Y)
TAA STOP
TAG STOP
CAT His (H)
CAC His (H)
CAA Gln (Q)
CAG Gln (Q)
AAT Asn (N)
AAC Asn (N)
AAA Lys (K)
AAG Lys (K)
GAT Asp (D)
GAC Asp (D)
GAA Glu (E)
GAG Glu (E)
G
TGT Cys (C)
TGC Cys (C)
TGA STOP
TGG Trp (W)
CGT Arg (R)
CGC Arg (R)
CGA Arg (R)
CGG Arg (R)
AGT Ser (S)
AGC Ser (S)
AGA Arg (R)
AGG Arg (R)
GGT Gly (G)
GGC Gly (G)
GGA Gly (G)
GGG Gly (G)
Table 9.1: Table of genetic code
are the same length, we consider pairs of nucleotides. Then there are
4 × 4 = 16
possible two-nucleotide “words”, and this is still not enough to represent 20 amino acids. Consider
all three-nucleotide “words”. There are
4 × 4 × 4 = 64
such words. This would more than suffice, and if we assume that there are no “nonsense” words
(that represent no amino acid at all) then it is clear that many synonyms must exist, i.e. several
distinct three-nucleotide words that all represent the same amino acid. It is now known that, indeed,
the three- letter words here envisioned form the standard basic genetic code, i.e. the code shared
by most organisms on our planet. From now on, we refer to the three-nucleotide “word” that codes
for an amino acid as a codon. Table 9.1 lists the amino acids with their corresponding codons.
(We also observe that there are three codons that are used to “punctuate” the code, i.e. signal the
end of a sequence coding for one protein.) Simple arguments, phrased in terms of probabilities, can
be used to study properties of this universal code of life. We shall use this biological example to
motivate some very elementary calculations in probability.
9.1.1
Simple amino acid probabilities
Let us imagine selecting one codon at random from a DNA sequence. How likely is it that the
codon represents a specific amino acid, e.g. Glycine (Gly)? To answer this question, we need to
know something about the probability of finding a given nucleotide in the three positions within a
v.2005.1 - January 5, 2009
3
Math 103 Notes
Chapter 9
codon. Some data about this is investigated in the problem set. Here we simplify the discussion
by assuming that each of the four nucleotides is equally likely to occur in each of the positions of
a given codon. (This is not, in fact, the case, but it serves the purpose of demonstrating essential
probability concepts.)
P (T ) = P (A) = P (G) = P (C)
Since only these four nucleotides are used, it must be true that
P (T ) + P (A) + P (G) + P (C) = 1
and therefore, the “equally likely” assumption means that P (T ) = P (A) = P (G) = P (C) = 0.25.
This means that each of the nucleotides has a probability 1/4 of occurring in any spot. There are
4 × 4 × 4 = 64 possible combinations of three letters, as shown in Table 1. Out of these, four codons
(GGT, GGC, GGA, GGG) code for Glycine. The probability that a codon chosen at random codes
for Glycine is then P(Gly)=4/64=1/16.
Which amino acid(s) would occur with highest probability? With lowest probability? Amino
acids that would be represented with greatest probability include Leucine (Leu), whose codons can
be any of (TTA, TTG, CTT, CTC, CTA, CTG). This means that the probability of leucine is
P(Leu)=6/64 = 3/32. From Table 1 we can also see that Arginine (Arg) has a similar probability,
with codons (CGT, CGC, CGA, CGG, AGA, AGG). The amino acids with fewest codons include
Tryptophan (Trp) whose only codon is TGG and methionine (ATG). Each of these would occur
with probability 1/64. What is the likelihood that the codon represents the end (STOP) of a protein
chain?There are three STOP sequences, so these occur with probability 3/64.
Amino acids with multiple (synonymous) codons
Since there are 64 possible distinct codons and only 20 amino acids, if the code was assigned
uniformly to all the amino acids, then on average there should be 64/20 = 3.2 (i.e. just over 3)
different codons representing a single amino acid. (We will refer to such codons as synonymous,
since they have the same interpretation). However, the assignment of codons is far from uniform. As
shown in Table 9.1, some amino acids have up to 6 synonyms (e.g. Leucine, Arginine, and Serine)
whereas others have only 1 (e.g. Methionine, Tryptophan). (The STOP mark is coded by three
distinct codons.) We could represent this information using a table, a bar graph, or a cumulative
function, just as we did for results of an experiment. An analysis of this type is carried out in the
problem set.
Mutations and codon volatility
A mutation is an event that changes one or more of the instructions encoded in a gene. The simplest
conceivable mutation is a single substitution of one nucleotide by some other nucleotide somewhere
along the DNA chain. This is called a single nucleotide mutation. Here we will assume that each
nucleotide in any codon has an equal probability of being replaced by this type of chance rare event.
We will examine how the amino acid represented by the given codon would change.
First, we observe that certain nucleotide substitutions result in synonymous codons for the same
amino acid. For example, the substitution of adenine for thymosine in GGT to GGA preserves the
v.2005.1 - January 5, 2009
4
Math 103 Notes
Chapter 9
CGG
(Arg)
CAG
(Gly)
ATG
(Met)
CCG
(Pro)
CTA
(Leu)
CTG
(Leu)
TAG
(Stop)
TCG
(Ser)
TTG
(Trp)
TTG
(Leu)
TTA
(Leu)
GTG
(Val)
CTT
(Leu)
CTC
(Leu)
TTT
(Phe)
TTC
(Phe)
Figure 9.2: A substitution in one nucleotide can change the meaning of a codon. Here we show two
codons that both represent the amino acid leucine, and the result of all possible single nucleotide
substitutions.
assignment of the amino acid glycine. The term volatility of a codon denotes the proportion of its
single nucleotide mutations that lead to different amino acids. In general since only one letter in
the 3-letter code is changed, there are 3 × 3 × 3 distinct neighbors of each codon that result from
a one-letter change. The volatility will be computed by the fraction of such neighbors (excluding
the STOP codons) that result in new assignments. The volatility of a whole gene is defined as the
average volatility of its codons. In general, the higher the volatility, the more likely it is that a
mutation will result in a change in the structure of a protein. Some changes of this type would be
catastrophic, resulting in a broken or nonfunctional protein. Others would be beneficial, leading to
changes that improve the function, change the structure, or lead to new properties. One situation in
which gene volatility is desirable is the evolution of parasites that must rapidly alter their external
coats to avoid being detected and eliminated by a host immune system. The human malaria parasite,
Plasmodium falciparum is one example in which high volatility is evident in numerous genes.
Examples
The amino acid Valine (Val) is coded by any of the four codons GTT, GTC, GTA, and GTG. Clearly
changing the last nucleotide in this codon leads to a synonymous codon. Changes in either of the
first two nucleotides leads to a distinct amino acid (and none leads to the termination sequence).
This means that all codons for Valine have the same volatility. To compute this volatility, we note
that three possible substitutions in nucleotide 1 or three possible substitutions in codon 2 (for a
total of 6 possible changes) will by nonsynonomous. Thus the volatility of any of the Valine codons
is 6/9=0.667. Only one sequence, TGG codes for Tryptophan. Any of the 9 possible changes of a
single nucleotide will lead to a different amino acid. Thus, tryptophan has volatility 9/9=1.
Not all codons for a given amino acid have the same volatility. In Figure 9.2 we show two codons
for leucine. The codon CTG has 5 neighbors that are not synonyms (Met, Gly, Arg, Pro, Val), so
v.2005.1 - January 5, 2009
5
Math 103 Notes
Chapter 9
its volatility is 5/9=0.556. The codon TTG has 6 out of 8 nonsynonymous neighbors (excluding
the STOP sequence TAG), so has volatility 6/8=0.75.
9.1.2
Cutting the strand: How restriction enzymes work
The genetic material in a cell is manipulated in many ways by enzymes that copy, splice, translate,
and rearrange it. Certain biomolecules, called restriction enzymes are responsible for carefully
cutting DNA at specific “markers” based on its nucleotide sequence. In general, these enzymes look
for some multiple nucleotide pattern, and cut the strand of DNA at the location of that pattern.
(Many of these enzymes are now used in biological experiments to manipulate DNA artificially
in order to dissect it into manageable fragments for sequencing purposes.) We here consider an
example in which the probability of finding a nucleotide at a given position is used to compute the
mean length of fragments produced by a simplified DNA-cutting enzyme of this type. We simplify
the problem to its most basic level, to illustrate how probabilities of single nucleotides are combined
to answer more involved questions.
For simplicity, assume that a hypothetical restriction enzyme cuts a DNA strand repeatedly,
but only after a specific nucleotide, e.g. the base G, as shown in Figure 9.3. We ask what is the
mean length of fragments created?
C
A A T C C TA G T
P(no G here)
X P (G here)
Figure 9.3: A restriction enzyme (shown here as a “pair of scissors”) cuts a DNA strand. In this
simple example, the cut is always after the nucleotide G. To compute the mean length of fragments,
we first need to find the probability of a given length, ℓ. This is the product of probabilities that
no G is found in any of ℓ − 1 positions, followed by a G in the ℓ’th position.
v.2005.1 - January 5, 2009
6
Math 103 Notes
Chapter 9
Mean length of fragments of a simple restriction enzyme
We assume that all nucleotides appear randomly with equal probabilities. Then, as before,
P(G)=P(C)=P(A)=P(T)
A fragment of length ℓ if the enzyme encounters ℓ − 1 bases that are NOT G followed by base G in
the ℓ’th position. Thus, the probability of a fragment of length ℓ is:
ℓ−1 3 3
1
3 1
3
P(fragment of length ℓ) =
· =
.
·
· ...
4 4
4 4
4
4
|
{z
}
ℓ−1 terms
The mean length of all fragments is found by computing
ℓ̄ =
∞
X
ℓP (ℓ)
ℓ=0
where we have used P (ℓ) to denote P(fragment of length ℓ). Then, by the above,
ℓ−1 ∞
∞
X
3
1 X ℓ−1
1
ℓ̄ =
ℓ
=
ℓr
·
4
4
4
ℓ=0
ℓ=0
where we have taken out the common factor (1/4) and used the notation r = 3/4 in the above.
In order to evaluate the above, we need a formula for the sum of the series shown here. In the
problem set, we show, (using the derivative of a geometric series) that for any r such that |r| < 1,
∞
X
kr k−1 =
k=0
1
.
(1 − r)2
Thus
ℓ̄ =
1
1
1
1
1
1
·
= ·
= ·
=4
2
2
4 (1 − r)
4 (1 − (3/4))
4 (1/4)2
Thus the mean length of the fragments is 4 nucleotides long.
The calculation of mean length of DNA fragments for an enzyme that cuts after every occurrence
of G, results in a mean length of 4 bases. We could reason intuitively that this makes sense: If
bases are equally likely, then G occurs in roughly 1/4 of the positions. Thus it would occur roughly
once per 4 nucleotides, giving rise to a 4-nucleotide fragment “on average”.
9.2
Hardy-Weinberg genetics
Each of us has two entire sets of chromosomes: one set inherited from our mother, and one set
from our father. These chromosomes carry genes, the unit of genetic material that “codes” for
proteins and ultimately, through complicated biochemistry and molecular biology, determines all of
our physical traits.
v.2005.1 - January 5, 2009
7
Math 103 Notes
Chapter 9
We will investigate how a single gene (with two “flavors”, called alleles ) is passed from one
generation to the next. We will consider a particularly simple situation, when the single gene
determines some physical trait (such as eye color). The trait (say blue or green eyes) will be
denoted the phenotype and the actual pair of genes (one on each parentally derived chromosome)
will be called the genotype.
Suppose that the gene for eye color comes in two forms that will be referred to as A and a. For
example, A might be an allele for blue eyes, whereas a could be an allele for brown eyes. Then each
individual must have one of the following pairs of combinations:
AA, aA, or aa.
(Note that aA and Aa will be considered to be equivalent here, i.e the order of the letters in the
genotype is not important.)
Suppose we know that the fraction of all genes for eye color of type A in the population is p,
and the fraction of all genes for eye color of type a is q, where p + q = 1. (This means that there are
only two possibilities for the gene type, of course.) Then we can interpret p and q as probabilities
that a gene selected at random from the population will turn out to be type a (respectively A), i.e.,
P(A) = p, P(a)=q.
Now suppose we draw at random two alleles out of the (large) population. (This is like tossing
a coin twice). We get the following cases with given probabilities (computed with the second
multiplication principle:)
The probability of finding the genotypes aA (or the equivalent Aa), say, is the same as the
probability of a AND A, and the probability of any genotype (aa, aA, AA, or Aa) is then the same
as the product of probabilities as follows:
Genotype:
Probability:
aA
pq
AA
p2
aa Aa
q 2 pq
We do not distinguish between genotypes aA and Aa, since the “order” of the alleles does not
matter. Thus the probability of having a genotype which includes a and A (aA or Aa) is 2pq. If the
population size is N, then, on average we would expect Np2 individuals of type AA, Nq 2 of type
aa and 2Npq individuals of the mixed type. The total probability of any of the genotypes is
p2 + 2pq + q 2 = (p + q)2 = 1
9.2.1
Random non-assortative mating
We now examine what happens if mates are chosen (at random) and father and mother pass down
one or another copy of their alleles to the progeny. We investigate how the proportion of genes
of various types is arranged. In the table below, we show the possible genotypes of the mother
and father, and calculate the probability that mating of such individuals would occur under the
assumption that choice of mate is random - does not depend at all on “eye color”. We assume that
the allele from the father (carried by his sperm) is independent of the allele found in the mother’s
egg cell. This means that we can use the multiplicative property of probability to determine the
probability of a given combination of parental alleles. (i.e. P (x, y) = P (x)P (y)).
v.2005.1 - January 5, 2009
8
Math 103 Notes
Chapter 9
For example, the probability that a couple chosen at random will consist of a woman of genotype
aA and a man of genotype aa is a product of the fraction of females that are of type aA and the
fraction of males that are of type aa. But that is just (2pq)(p2 ) = 2p3 q. Now let us examine the
distribution of possible offspring of various parents.
In the table, we note, for example, that in the case that the couple are both of type aA, each
parent can “donate” either a or A to the progeny, so we expect to see children of types aa, aA, AA
in the ratio 1:2:1.
We can now group together and summarize all the progeny of a given genotype, with the
probabilities that they are produced by one or another such random mating. Using this table, we
can then determine the probability of each of the three genotypes in the next generation.
Mother:
Frequency:
Father:
AA
aA
p2
2pq
AA
p2
aA
2pq
aa
q2
AA
p4
1
aA 12 AA
2
2
Aa
p2 q 2
1
aA 12 AA
2
2
1
aa 12 aA 41 AA
4
2 2
1
aa 12 Aa
2
2
Aa
p2 q 2
1
aA 21 aa
2
2
aa
q4
2pqp
aa
q2
2pqp
4p q
2pqq
2pqq
Table 9.2: The frequency of progeny of various types in Hardy-Weinberg genetics can be calculated
as shown in this table. The genotype of the mother is shown across the top and the father’s genotype
is shown on the left column. The various progeny resulting from mating are the table entries, with
the probabilities directly underneath each genotype.
Problem
Find the probability that a random (Hardy Weinberg) mating will give rise to a progeny of type
AA.
Solution 1
Using Table 9.2, we see that there are only four ways that a child of type AA can result from a
mating: either both parents are AA, or one or the other parent is Aa, or both parents are Aa. Thus,
for children of type AA the probability is
1
1
1
P(child of type AA) = p4 + (2pqp2 ) + (2pqp2 ) + (4p2 q 2 )
2
2
4
v.2005.1 - January 5, 2009
9
Math 103 Notes
Chapter 9
Simplifying leads to
p(child of type AA) = p2 (p2 + 2qp + q 2 ) = p2 (p + q)2 = p2
In the problem set, we also find that the probability of a child of type aA is 2qp, the probability
of the child being type aa is q 2 . We thus observe that the frequency of genotypes of the progeny
is exactly the same as that of the parents. This type of genetic makeup is termed Hardy-Weinberg
genetics.
Alternate solution
child
AA
father
mother
2pq
p2
2pq
p2
Aa
AA
Aa
1
1
1/2
A
or
A
AA
1/2
A
or
A
(pq+p 2 ) . ( pq + p2 )
Figure 9.4: A tree diagram to aid the calculation of the probability that a child with genotype AA
results from random assortative (Hardy Weinberg) mating.
In Figure 9.4, we show an alternate solution to the same problem using a tree diagram. Reading
from the top down, we examine all the possibilities at each branch point. A child AA cannot have
any parent of genotype aa, so both father and mother’s genotype could only have been one of AA
or Aa. Each arrow indicating the given case is accompanied by the probability of that event. (For
example, a random individual has probability 2pq of having genotype Aa, as shown on the arrows
from the father and mother to these genotypes.) Continuing down the branches, we ask with what
probability the given parent would have contributed an allele of type A to the child. For a parent
of type AA, this is certainly true, so the given branch carries probability 1. For a parent of type
Aa, the probability that A is passed down to the child is only 1/2. The combined probability is
computed as follows: we determine the probability of getting an A from father (of type AA OR
Aa): This is P(A from father)=(1/2)2pq + 1 · p2 ) = (pq + p2 ) and multiply it by a similar probability
of getting A from the mother (of type AA OR Aa). (We must multiply, since we need A from the
father AND A from the mother for the genotype AA.
Thus P(child of type AA) =(pq + p2 )(pq + p2 ) = p2 (q + p)2 = p2 · 1 = p2 .
It is of interest to investigate what happens when one of the assumptions we made is relaxed,
for example, when the genotype of the individual has an impact on survival or ability to reproduce.
v.2005.1 - January 5, 2009
10
Math 103 Notes
9.3
Chapter 9
Random walker
In this section we discuss an application of the binomial distribution to the process of a random
walk. A shown in Figure 9.5(a), we consider a straight (1 dimensional) path and an erratic walker
who takes steps randomly to the left or right. We will assume that the walker never stops. With
probability p, she takes a step towards the right, and with probability q she takes a step towards
the left. (Since these are the only two choices, it must be true that p + q = 1.) In Figure 9.5(b)
we show the walkers position, x plotted versus the number of steps (n) she has taken. (We may as
well assume that the steps occur at regular intervals of time, so that the horizontal axis of this plot
can be thought of as a time axis.)
(a)
q
p
x
−1
0
1
(b)
x
n
Figure 9.5: A random walker in 1 dimension takes a step to the right with probability p and a step
to the left with probability q.
The process described here is classic, and often attributed to a drunken wanderer. In our case, we
could consider this motion as a 1D simplification of the random tumbles and swims of a bacterium
in its turbulent environment. it is usually the case that a goal of this swim is a search for some
nutrient source, or possibly avoidance of poor environmental conditions. We shall see that if the
probabilities of left and right motion are unequal (i.e. the motion is biased in one direction or
another) this swimmer tends to drift along towards a preferred direction.
In this problem, each step has only two outcomes (analogous to a trial in a Bernoulli experiment).
We could imagine the walker tossing a coin to determine whether to move right or left. We wish to
characterize the probability of the walker being at a certain position at a given time, and to find her
expected position after n steps. Our familiarity with Bernoulli trials and the binomial distribution
will prove useful in this context.
Example
(a) What is the probability of a run of steps as follows: RLRRRLRLLLL
v.2005.1 - January 5, 2009
11
Math 103 Notes
Chapter 9
(b) Find the probability that the walker moves k steps to the right out of a total run of n
consecutive steps.
(c) Suppose that p = q = 1/2. What is the probability that a walker starting at the origin returns
to the origin on her 10’th step?
Solution
(a) The probability of the run RLRRRLRLLL is the product pqpppqpqqq = p5 q 5 . Note the
similarity to the question ”What is the probability of tossing HTHHHTHTTT?”
(b) This problem is identical to the problem of k heads in n tosses of a coin. The probability of
such an event is given by a term in the binomial distribution:
P(k out of n moves to right)=C(n, k)pk q n−k .
(c) The walker returns to the origin after 10 steps only if she has taken 5 steps to the left (total)
and 5 steps to the right (total). The order of the steps does not matter. Thus this problem
reduces to the problem (b) with 5 steps out of 10 taken to the right. The probability is thus
P(back at 0 after 10 steps) = P(5 out of 10 steps to right)
10 1
1
10!
5 5
=C(10, 5)p q = C(10, 5)
=
= 0.24609
2
5!5! 1024
Mean position
We now ask how to determine the expected position of the walker after n steps, i.e. how the mean
value of x depends on the number of steps and the probabilities associated with each step. After 1
step, with probability p the position is x = +1 and with probability q, the position is x = −1. The
expected (mean) position after 1 move is thus
x1 = p(+1) + q(−1) = p − q
But the process follows a binomial distribution, and thus the mean after n steps is
xn = n(p − q)
.
9.4
Further examples and problems
This section is intended to help with practice of concepts of permutations and combinations, and
to provide further examples of calculations of discrete probability.
v.2005.1 - January 5, 2009
12
Math 103 Notes
Chapter 9
The Monty Hall problem
Monty Hall was the host of a television game-show called “Let’s Make a Deal”. This example
concerns a problem in probability named in his honour. The problem attracted media attention
when it was featured in a column written by Marilyn Vos Savant. The game goes as follows: There
are three doors, and a contestant is offered the opportunity to win a car (behind one of the doors).
Behind the other two doors is a less attractive option (e.g. a goat). The contestant at first selects
one door. Monty Hall (who knows what is behind each door) opens one of the other two doors
to reveal a goat and asks: “Would you like to change your mind, or to stay with your original
selection?”. The question to be answered is whether, given this information, it is a better strategy
to switch or to stay with the first door.
We analyze this problem with the tree diagram shown in Figure 9.6. The question we ask
is whether the odds of winning are greater if the contestant switches or stays with the original
selection. Starting from the top of the diagram, we show each possibility with a “branch”, and
assign a probability, assuming (as usual) that it is equally likely that the car is behind any of the
three doors, and that Monty Hall always opens one of the other doors that contains a goat.
For example, if your selection is labeled A, and it so happens that there is a goat behind A, then
the car is equally likely to be behind doors B or C (probability 1/2 each). Monty will certainly open
the other door with the goat (probability 1). In that case, switching leads to a win. We compute
the probability of those events by multiplying the assigned probabilities down the length of each
branch, and then adding the results of the relevant branches (as shown in the dotted lines at the
base of the diagram in Figure 9.6.
In case the car is actually behind the initially selected door, staying leads to a win, but as shown
in the calculation, this happens with lower probability based on the assumptions in the problem.
Therefore, the winning strategy is to switch your selection after Monty Hall opens one of the doors.
Example 2’
A class of 12 students is divided into three equal teams of 4 to work together on three history group
projects. The first team will investigate the civilization of the Mayas (project 1), the second team
will research the Aztecs (project 2), and the third team will work on the Incas (project 3). How
many different ways are there of forming the teams? Assume that the order of individuals within a
team does not matter, but the order in which teams are picked determines which team gets project
1, or project 2, etc.
Solution
There will be a total of 3 teams formed by this subdivision. As noted above, the order of the teams
matters, since the projects assigned are distinct. We must find out how many ways there are of
choosing people from the class to fill each of these teams.
For the first team, the number of ways of selecting 4 out of 12 people is given by
C(12, 4) =
v.2005.1 - January 5, 2009
12!
12 · 11 · 10 · 9
=
= 495
8!4!
4·3·2
13
Math 103 Notes
Chapter 9
You first pick:
Door A
1/3
2/3
goat in A
1/2
1/2
car
in B
car
in C
1/2
1/2
C
B
1
1
Monty opens:
car in A
C
You:
stay
switch
to B
Result:
lose
win
B
stay switch
to C
stay switch
stay switch
lose
win
win
win
P(win if stay) =
lose
(1/3)(1/2)
P(win if switch)= (2/3)(1/2)
+
lose
(1/3)(1/2)
+ (2/3)(1/2)
= 1/3
= 2/3
Figure 9.6: The Monty Hall problem.
v.2005.1 - January 5, 2009
14
Math 103 Notes
Chapter 9
[Remark: we use the formula for combinations here, because the arrangement of individuals within
a team is not relevant: a team composed of Mary, Bob, Jack, and Jane is the same as a team
composed of Bob, Jane, Jack, and Mary.] For each one of these ways we now have many ways of
choosing the remaining teams. Once this team has formed, we have only 8 people left to choose
from for the other teams. So for the second team, the number of combinations are
C(8, 4) =
8!
= 70.
4! · 4!
We now have only four people left, and they have to form the last team. The total number of ways
of forming these teams is the product of the three results obtained above, i.e
495 · 70 · 1 = 34650.
Example 3’
(a) George has a photograph of each of his 8 sisters, but a wallet-sized photo album with 5 spaces
for some of these photos. How many different arrangement of these photos could George make in
the available space?
(b) What is the probability that both Mary and Jane will be included in the photo album if
each photograph is selected randomly for the display? (Assume that Mary and Jane are two of the
sisters.)
Solution
(a) Here the arrangement of the photos is relevant, so we must consider permutations. (This means
that a display in which Mary is first and Jane second is considered distinct from a display in which
Jane is first and Mary is second.) The number of ways of arranging 8 objects into 5 slots is
P (8, 5) =
8!
= 8 · 7 · 6 · 5 · 4 = 6720.
(8 − 5)!
(b) The probability that Mary will be selected to fill the first slot is 1/8. The probability that
she is selected for any one of the 5 slots is thus 5/8. But if Mary was selected, there would then be 7
sisters left to choose between, and 4 slots in which to place their photographs. Thus the probability
that Jane is then selected for one of these other slots is 4/7. The probability that both of the above
occur is (5/8) · (4/7) = 0.357.
v.2005.1 - January 5, 2009
15
Math 103 Notes
9.5
Chapter 9
Appendix
Abbrev
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Abbrev
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
Amino acid
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamine
Glutamic acid
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Table 9.3: Common abbreviations for the amino acids
9.6
For further reading
• Plotkin JB, Dushoff J, Fraser H B (2004) Detecting selection using a single nucleotide sequence
of M tuberculosis and P falciparum, Nature 428: 942-945 (April 29, 2004).
• Plotkin JB, Dushoff J (2003) Codon bias and frequency-dependent selection on the hemagglutinin epitopes of influenza A virus, PNAS 100:7152-7157 (June 10, 2003).
• Zhang J (2005) On the evolution of codon volatility, Genetics 169:495-501 (January 2005).
v.2005.1 - January 5, 2009
16