Download Week 20

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Transcript
WEEK #20: Probabilistic Models, Diffusion and Genetics
Goals:
Textbook reading for Week #20: Study Adler Section 6.1, 6.2.
Course Information
For the rest of the course, we will be working with fundamental ideas in probability, and
how they relate to biological models, as well as to our earlier work with calculus.
The textbook references for the remaining material comes from the original course text,
“Modeling the Dynamics of Life” by Adler, Chapters 6, 7, and 8.
Introduction To Probabilistic Models
From Section 6.1
Contrast the underlying assumptions of deterministic vs stochastic models.
Give two examples of deterministic population models, one with discrete time, and
one with continuous time.
Example:
A discrete-time stochastic population model could be defined by the
following statements:
• 60% chance a population will have a per capita production of 1.2 in a given year,
• 20% chance that the production rate will be 1.0 per capita (population remains the
same), and
• 20% chance that the production rate will be 1.4 per capita.
What do you think the average per capita production rate will be under this model?
Simulation
For the purposes of comparison, let us define two models, one
stochastic.


1.2Pt 60%
Pt+1 = 1.2Pt
Pt+1 = 1.0Pt 20%

1.4P 20%
t
deterministic and one
chance
chance
chance
If we start both models at an initial population of P0 = 1000, can we predict exactly when
each model will reach P = 150, 000?
For the stochastic or probabilistic model, there is an element of chance to the future
predictions about population. In an era of inexpensive computing, an easy way to explore
the predictions of a model are to simulate them. We program a computer to simulate
a random number being drawn, choose a population rate as a result of the number, and
then update the population according to that randomly-selected rate. Repeating this
work over and over, we can predict the population curve several times to understand the
possible range of behaviour in the model.
0
5
10
15
Years
20
25
0
100000 200000
+
+
+
+
+
+
+
+
++
++
+
+
+
+
++++++++++++++
30
+
+
Pop.
100000 200000
0
Pop.
+
+
+
+
+ ++
+
+
+
+ +
++
+
+
+ ++++
+
+
+ ++
+
+ +++
++
++
++
+++
+
+
++
+
+
+
++
+
+
+
+
++ ++
++++
+++
++++
+++++
+++
0
5
10
15
Years
20
25
30
What would be some stochastic alternatives to the deterministic question “At what time
does the population reach 150,000”?
What other questions might be relevant, given the stochastic population model?
One way to help answer some of the questions about a model’s predictions is to use the
simulation approach to generate a large number of predicted populations, and then
display the range and popularity of various outcomes using a histogram.
0
5000
15000
First Year The Population Exceeded 150,000
Freq (out of 100,000)
30000
10000
0
Freq (out of 100,000)
Population at Year 30
0
500000
1500000
Population
20
25
30
35
40
Cross Year
Which of our earlier questions can be answered based on these histograms?
45
As we will see several times in our study of probability, the logarithm transform is an
important tool. Consider the range of year 30 populations we simulated earlier. The
histogram had an odd, asymmetric shape to it. Compare that to the histogram of the
log transformed populations (log base 10):
0
5000 10000
Population at Year 30
Freq (out of 100,000)
30000
10000
0
Freq (out of 100,000)
Population at Year 30
0
500000
1500000
Population
4.5
5.0
5.5
6.0
6.5
Log10(Population)
On the graph, indicate the original units of the logged populations shown.
Use the histogram to determine (roughly) the average or expected population at year
30.
Markov Chains
An exciting development in modeling tools for biologists and geneticists in the last 20
years are Markov Chains or Markov Models. They were studied heavily in the
1960s, and starting in 1989 they were used to identify different region types in DNA,
RNA, and protein sequences; today they are used to mine the ever-growing genetic
databases for alignments of DNA and RNA sequences across different organisms for
evolutionary comparison, and to identify common protein sequence and structure motifs
for drug development. Markov chains have also been used in fields of synthetic music
generation and authorship identification, speech recognition and generation, and other
sequence classification tasks.1
In eukaryotes, we have long regions of so-called “junk”, or non-coding, DNA. These
sequences are not entirely random, however: certain bases are more likely than other,
and certain short subsequences are more likely than others. If we consider a simple
two-base DNA universe to start with, one way to model these sequences is with a state
diagram like the one shown below:
Here are some sample sequences from that model
A A A A A A G A G A A A G A G A A G A A A A A A A A G A A A ...
G G A A A G A G A A G A G A G A G A A A A A A G A G A A G A ...
G A A A G A A A G A A A A G A A A A G A A A A A G A A G G A ...
A A G A A A G A G G G G A G G G A A A A A A G G A A G A G G ...
1
Typically, these applications are actually using Hidden Markov Models, or HMMs. HMMs models are a
relatively simple extension of the Markov chains we will study in this course.
http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html dev/main.html
http://www.research.ibm.com/journal/rd/453/birney.html
Question: Which of the following sequences is not likely to have come from this model?
A) G A G A A G G A G G G G A G G G A G G A A G G G G G G A A G ...
B) A A G G G G G G G A A G G A A G A G G G G G A A A A A G G G ...
C) G A A A A G G G A G A G A G G G A G A G G A A G A A G A A A ...
D) A G A G A G G A G G A A A A A G A G A G A A A A A G A G G A ...
E) A A G A A A G A G A G G A G G A G G G A G A A G A A A A A G ...
How sure can you be of your answer?
Stochastic Models of Diffusion
From Section 6.2
Diffusion is something we have studied in both discrete and continuous time, but in a
deterministic way. In fact, chemists and physicists would argue that diffusion is more
accurately a random or stochastic process, because individual molecules pass through
the membrane based on collision angles, energy levels, steady and transient polar charge
forces, etc.
Sketch:
However, we can still keep an eye on the process from a macroscopic level, by measuring
concentrations on both sides of the membrane.
Example: By repeated experiments we find that, on average, the concentration of a
toxin inside a membrane drops by 20% per hour. Draw a digram for a stochastic model
for an individual toxin molecule over time.
We now define pt = probability that the molecule is inside the membrane at time t (hours).
If we assume that the molecule is inside the membrane initially, write the next few values
for pt , and then a general formula.
What kind of equation is this? How does it differ from similar models we have seen
before?
The discrete time models we have seen before are related to these stochastic ones. In
this case, they are associated by expectation.
Question: After 2 hours, we have p2 = 0.64. This means that at t = 2,
A) there is a 64% chance of a particular molecule still being inside the membrane.
B) exactly 64% of all the molecules we started with will still be inside the membrane
C) approximately 64% of all the molecules we started with will still be inside the membrane
D) There is a 64% chance that all of the molecules will still be inside the membrane
Effects of Sample Size
Consider the following simulation graphs.
10
15
20
4
3
2
0
1
# Molec. Inside
3
2
0
5
0
5
10
15
20
0
5
10
15
20
Simulation of 40 molecules
Simulation of 40 molecules
Simulation of 40 molecules
0
5
10
15
20
30
20
0
10
# Molec. Inside
30
20
# Molec. Inside
0
10
30
20
10
40
t
40
t
40
t
0
0
5
10
15
20
0
5
10
15
20
t
t
Simulation of 400 molecules
Simulation of 400 molecules
Simulation of 400 molecules
0
5
10
t
15
20
400
300
200
100
# Molec. Inside
300
200
0
0
0
100
200
# Molec. Inside
300
400
400
t
100
# Molec. Inside
1
# Molec. Inside
3
2
1
# Molec. Inside
0
0
# Molec. Inside
Simulation of 4 molecules
4
Simulation of 4 molecules
4
Simulation of 4 molecules
0
5
10
t
15
20
0
5
10
15
20
t
What graph you think we’re approaching as we increase the number of molecules in our
simulation?
Here are similar simulations, but with the function p(t) = p0 (0.8)t overlayed for comparison.
5
10
t
15
20
600
# Molec. Inside
0
0
0
200
80
60
40
# Molec. Inside
20
10
8
6
4
2
0
# Molec. Inside
Simulation of 1000 molecules
1000
Simulation of 100 molecules
100
Simulation of 10 molecules
0
5
10
t
15
20
0
5
10
15
20
t
We can also view these simulations in tabular form. The values shown here are the same
as those used to generate the graphs above.
t
0
1
2
3
4
5
6
7
8
9
10
11
t
pt = 0.8
1.000 0.800 0.640 0.512 0.409 0.327 0.262 0.209 0.167 0.134 0.107 0.08
Sim N0 = 10
10
9
8
7
7
6
4
3
2
1
0
0
Sim N0 = 100
100
77
69
52
36
26
22
17
17
17
12
11
Sim N0 = 1000 1000
805
641
522
401
310
238
191
151
120
92
74
You should note that the values of the simulations do not get closer to the expected values
as we use more molecules. e.g. at t = 6, the individual simulations were off by 1, 3,
and 18 molecules respectively for N0 = 10, 100, and 1000. Why then does the graph for
N0 = 1000 look so much closer to the ideal curve than the smaller sample simulations?
If there are billions of molecules in a sample, what are the trade-offs of using a deterministic model instead of a stochastic model for diffusion?
Give an example of a scenario or experiment where a deterministic model for diffusion
would not be appropriate.
Markov Chain Model for Diffusion
So far we have only modeled one-way flow through the membrane. However, our earlier
studies covered the more realistic case of two-way flow through the membrane. We can
construct a stochastic model for this using a Markov chain.
Describe the assumptions in this model in words.
The original scenario, where the molecules could only leave the membrane, can also be
written as a Markov chain. Sketch the transition diagram for our original model.
Using the leave-and-enter model, assume a single molecule starts inside the membrane.
Compute the probability that it is in the membrane for the next few time points.
Find the general formula for pt+1 , the probability that an individual molecule with be
inside the membrane at time t + 1. It will depend on pt .
Find the long-term probability that an individual molecule will be inside the cell.
Here are some simulations, showing both the ideal number of molecules and simulated
number, for different numbers of initial molecules. Note the long-term stable level indicated in the graphs.
0
10
20
30
t
40
50
600
0
200
# Molec. Inside
80
60
40
# Molec. Inside
0
20
10
8
6
4
2
0
# Molec. Inside
Simulation of 1000 molecules
1000
Simulation of 100 molecules
100
Simulation of 10 molecules
0
10
20
30
t
40
50
0
10
20
30
40
50
t
Comment on the agreement between the ideal probabilities and the simulation results.
Genetics
You should read the presentation of selfing and inbreeding effects, pages 498-500, on your
own. The case of self-reproduction or inbreeding is an interesting one, especially on its
effects on genetic diversity, but we do not have time to cover it in class.
Genetics are a rich source of probabilistic models. To start with, we will explore some
simple models applied to plant breeding.
Nomenclature
• an allele is a variant of a gene.
• an organism is diploid if it has two separate copies of a gene
• a diploid organism is homozygous if both alleles for a gene are the same, and
heterozygous if the alleles are different.
• The set of alleles of an organism is its genotype. This is usually not directly
measured, but inferred from the visible or macroscopic properties of the organism,
its phenotype.
The presence of different alleles for a set of genes can affect the phenotype of an organism
in a variety of ways.
Give some examples of genotype/phenotype relationships.
Probability Theory
From Section 6.3
It will be helpful, before we go too much further, to define some nomenclature and
properties of probabilities and experiments.
Sample space for an idealized experiment:
Experiment: Measuring the height of a random student in the class.
Experiment: Throwing a dart at a wall with a dart board on it.
Experiment: Measuring the per-capita reproductive rate for a population over one year.
Event in an experiment:
Experiment: Measuring the height of a random student in the class.
Experiment: Throwing a dart at a wall with a dart board on it.
Experiment: Measuring the per-capita reproductive rate for a population over one year.
Representing Sample Space and Events
Venn diagrams and line graphs are often used to represent the sample space, events, and
their interrelationships.
Draw a Venn Diagram and line diagram for the annual per-capita reproductive rate experiment, showing the event “population grew”.
Add to the diagrams the second event “population grew at a rate r = 0.5 or higher”.
Add to the diagrams the third event “population was constant or shrunk.”
Nomenclature
Define the following terms, then give some examples of each.
• Mutually exclusive events
• Union of two events
• Intersection of two events
• Complement of an event
• Collectively Exhaustive Set of events
Formal Probability Models
A probability model assigns probabilities to all events. We will write Pr(A) for the
probability of an event A.
To be self-consistent, the assigned probabilities must satisfy the following 4 requirements
(or “axioms”).
For each rule, draw a Venn diagram representing it.
1. If S is the the sample space, Pr(S) = 1.
2. 0 ≤ Pr(A) ≤ 1 for any event A.
3. If A and B are mutually exclusive events, Pr(A ∪ B) = Pr(A) + Pr(B).
4. If Ac is the complement of A, then Pr(Ac ) = 1 − Pr(A).
We see now why our earlier models had certain properties.
Example: For the diffusion model shown below, describe the sub-experiment that occurs
at each step, the sample space for each step, and what constraints there are on the
transition probabilities as a result.
If we were to run the diffusion model for 10 time steps, how would we define the sample
space?
Give examples of events we could describe based on the sample space for the 10-step
diffusion experiment.