* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Probability Theory as Extended Logic: Probability Theory as
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Minimal genome wikipedia , lookup
Probability Theory as Extended Logic: A short introduction into quantitative reasoning with incomplete information • • • • • Axiomatic derivation of probability theory. Bayes’ theorem and posterior probalities vs. p-values and confidence intervals Model selection: inferring dependence between variables. Prior probabilities: symmetry transformations and the maximum entropy principle. Stochastic processes: generating functions and the central limit theorem. Erik van Nimwegen Division of Bioinformatics Biozentrum, Universität Basel, Swiss Institute of Bioinformatics Probability Theory as Extended Logic: A short introduction into quantitative reasoning with incomplete information I cannot conceal the fact here that in the specific application of these rules, I foresee many things happening which can cause one to be badly mistaken if he does not proceed cautiously. Jacob Bernoulli, Ars Conjectandi, Basel 1705 Probability Theory as Extended Logic. E.T. Jaynes in 1982 Almost everything in this lecture can be found in this book. Jaynes left the book unfinished when he died in 1998. The unfinished version was available on the internet for many years (it still is). It was edited by a former student and was finally published in 2003. E. van Nimwegen, EMBnet Basel, March 2006 From logic to extended logic Aristotelian logic is a calculus of propositions. It tells us how to deduce the truth or falsity of certain statements from the truth of falsity of other statements. Assume: If A is true then B is true. Or in symbols: B|A A is true B is true. (B|A)(A) = (B)(A) B is false A is false. (B|A)(~B) = (~A)(~B) But in reality it is almost always necessary to reason like this: A becomes more plausible B is true A is false B becomes less plausible Or even: If A is true than B becomes more plausible A becomes more plausible B is true E. van Nimwegen, EMBnet Basel, March 2006 From logic to extended logic R.T. Cox (1946): 1. Plausibilities are represented by real numbers and depend on the information we have, i.e. P(x|I) the plausibility of x given our information I. 2. Plausibilities should match common sense: They should reduce to logic for statements that we know to be true or false and should go up and down in accordance with common sense. 3. Consistency: If a plausibility can be derived in multiple ways, all ways should give the same answer. The solution is unique and matches probability theory a la Laplace. The two quantitative rules (1) P ( A | I ) + P ( ¬A | I ) = 1 A certainly true statement has probability 1, a false statement has probability 0. The probability that the statement is true determines the probability that the statement is false. (2) P ( AB | I ) = P( A | BI ) P( B | I ) = P ( B | AI ) P( A | I ) The probability of A and B given the information I can be written as either The probability of B given I times the probabiltiy of A given B and I, or As the probability of A given I times the probability of B given A and I. Example: The probability that there is liquid water and life on mars is the probability that there is liquid water times the probability of life given liquid water or the probability of life times the probability of liquid water given life. Assigning probabilities using symmetry • Assume n mutually exclusive and exhaustive hypotheses Ai n ∑ P( A | I ) = 1 i =1 i n ∑ P( A A i =1 • Assume you know nothing else. • Consistency now demands that: i j | I ) = 0, ∀i ≠ j P( Ai | I ) = 1 , ∀i n Proof: • Any relabelling of our hypotheses changes our problem into an equivalent problem. That is, the same information I applies to all. • When the supplied information I is the same the assignment of probabilities has to be the same. • Unless all P(Ai|I) are equal this will be violated. Contrast with ‘frequency’ interpretation of probabilities • In orthodox probability theory a probability is associated with a random variable and records the physical tendency for something to happen in repeated trials. Example: The probability of “a coin coming up heads when thrown” is a feature of the coin and can be determined by repeated experiment. Contrast with ‘frequency’ interpretation of probabilities • In standard probability theory a probability is associated with a random variable and records the physical tendency for something to happen in repeated trials. Example: The probability of “a coin coming up heads when thrown” is a feature of the coin and can be determined by repeated experiment. • Quote from William Feller (an Introduction to Probability Theory and its Applications 1950): The number of possible distributions of cards in Bridge is almost 1030. Usually we agree to consider them as equally probable. For a check of this convention more than 1030 experiments would be required. Contrast with ‘frequency’ interpretation of probabilities • In standard probability theory a probability is associated with a random variable and records the physical tendency for something to happen in repeated trials. Example: The probability of “a coin coming up heads when thrown” is a feature of the coin and can be determined by repeated experiment. • Quote from William Feller (an Introduction to Probability Theory and its Applications 1950): The number of possible distributions of cards in Bridge is almost 1030. Usually we agree to consider them as equally probable. For a check of this convention more than 1030 experiments would be required. • Is this really how anyone reasons? Example: Say that I tell you that I went to the store, bought a normal deck of cards, and dealt 1000 Bridge hands, making sure to shuffle well between every two deals. I found that the king and queen of hearts were always in the same hand. What would you think? Assessing the evolutionary divergence of two genomes A reference genome has G genes: G • A different strain of the species is isolated and we want to estimate what number g of its genes is mutated with respect to the reference genome. ? • To estimate this we sequence one gene at a time from the new strain and compare it with the reference genome. Wildtype: Mutant: Assessing the evolutionary divergence of two genomes G After sequencing (m+w) genes we have m mutants and w wildtypes What do we now know about the number g of all genes that are mutants? Formalizing our information: • We have no information if the two genomes are closely or distantly related so a priori g = G is as likely as g = 0 or any other value. • Assuming the number of mutants g given, there is no information about which of the G genes are the mutants. Assessing the evolutionary divergence of two genomes Formalizing our information: • Prior probability P( g | I ) that g genes are mutant given our information: P( g | I ) = 1 G +1 • Assuming g mutants, the probability that the first sequenced gene will be a mutant or wildtype: P( μ | g ) = G−g g , P( wt | g ) = G G • The probabilities for the first two sequenced genes are: g ( g − 1) g (G − g ) P( μ , μ | g ) = , P( μ , wt | g ) = , G (G − 1) G (G − 1) (G − g )(G − g − 1) (G − g ) g , P( wt , wt ) = P ( wt , μ ) = G (G − 1) G (G − 1) and so on Assessing the evolutionary divergence of two genomes Generally, the probability for a particular series of mutant/wildtype observations containing m mutants and w wildtype is given by: P (m, w | g ) = g ( g − 1) L ( g − m + 1)(G − g )(G − g + 1) L (G − g − w + 1) G (G − 1) L (G − m − w + 1) or P (m, w | g ) = g!(G − g )!(G − m − w)! ( g − m)!(G − g − w)!G! We now know the prior probability P( g | I ) that a certain number of genes are mutants. We know the likelihood P (m, w | g ) to observe a given string of observations given g. We want to know the posterior probability P ( g | m, w) of g given m and w. Assessing the evolutionary divergence of two genomes Bayes’s Theorem We can write the joint probability of g, m, and w in terms of conditional probabilities in two ways: P (m, w, g | I ) = P (m, w | g ) P ( g | I ) and Combining these we obtain: P (m, w, g | I ) = P ( g | m, w) P (m, w | I ) P( g | m, w) = P(m, w | g ) P( g | I ) P(m, w | I ) We also have: P ( m , w | I ) = ∑ P ( m , w, g ' | I ) = ∑ P ( m , w | g ' ) P ( g ' | I ) g' Putting it all together we have: P ( g | m, w) = g' P(m, w | g ) P( g | I ) ∑ P(m, w | g ' ) P( g '| I ) g' Assessing the evolutionary divergence of two genomes G g!(G − g )! g!(G − g )! , Z (m, w) = ∑ P( g | m, w) = Z (m, w)( g − m)!(G − g − w)! g = 0 ( g − m)!(G − g − w)! m=0,w=0 m=0,w=1 m=2,w=5 m=23,w=50 Assessing the evolutionary divergence of two genomes All our information about g is encoded in the posterior distribution For example: the probability for g to lie in a particular interval is simply given b by summing the probabilities: P(a ≤ g ≤ b | m, w) = ∑ P( g | m, w) g =a m=2,w=5 a b Assessing the evolutionary divergence of two genomes The 95% posterior probability interval 41 P( g < 42 | 2,5) = ∑ P( g | 2,5) ≈ 0.025 P( g > 325 | 2,5) = g =0 500 ∑ P( g | 2,5) ≈ 0.025 g =326 m=2,w=5 95% 42 325 How does this compare to so called confidence intervals of orthodox statistics? Confidence Intervals • In orthodox statistics one cannot talk about the probability of g. • The probability of a particular sample is the same as we have it, for example: P ( μ , wt , μ , wt , wt | g ) = g (G − g )( g − 1)(G − g − 1)(G − g − 2) G (G − 1)(G − 2)(G − 3)(G − 4) • Now one focuses on a statistic s which is a function of the sample. For example a statistic s could be the total number of mutants: s ( μ , wt , μ , wt , wt ) = 2 • One then calculates the probabilities P(s) for the statistic to take on different values: g! (G − g )! n!(G − n)! P( s) = s!( g − s )! (G − g − n + s )!(n − s )! G! with n the total number of samples. Confidence Intervals P( s) = g! (G − g )! n!(G − n)! s!( g − s )! (G − g − n + s )!(n − s )! G! This is called the hypergeometric distribution. Given a fixed values of g we can calculate the probability that s takes on different values. For example, for g=175 and n=20. g=175,n=20,G=500 Confidence Intervals • We can now, for a given g, calculate the range of s values that is likely to occur. • For example, we can find smin and smax such that s min n s =0 s = s max ∑ P(s) = 0.025, ∑ P(s) = 0.025 • With probability 95% s will lie in the interval [smin ,smax] g=175,n=20,G=500 Confidence Intervals g=175,n=20,G=500 But we need an interval for g given an s not for s given a g! Solution: For a given s, find all values of g such that s occurs within the 95% confidence interval for that g. Confidence Intervals Solution: For a given s, find all values of g such that s occurs within the 95% confidence interval for that g. Find gmin and gmax such that: s n ∑ P( s' | g s '= s min ) = 0.05 g=27,G=500,n=7 ∑ P( s' | g s '= 0 max ) = 0.05 g=328,G=500,n=7 The 95% confidence interval for g is thus [27,328] Assessing the evolutionary divergence of two genomes What to do if we are forced to make one specific estimate gest of g? m=2,w=5 Assessing the evolutionary divergence of two genomes What to do if we are forced to make one specific estimate gest of g? m=2,w=5 g * = 143 We could pick the gest = g* that maximizes the posterior P( g | m, w) However, it is clear that more often g>143 than g<143. So can’t we decrease the expected “error” by choosing gest a bit larger? Assessing the evolutionary divergence of two genomes What is the “error” we want to minimize? Absolute deviation: E ( g est , g true ) ∝| g est − g true | Square deviation: E ( g est , g true ) ∝ ( g est − g true ) 2 All errors equally bad: E ( g est , g true ) ∝ 1 − δ ( g est , g true ) G General solution, minimize expected error: E = ∑ E ( g est , g ) P( g | m, w) g =0 g est Absolute deviation: ∑ P( g | m, w) = g =0 G G ∑ P( g | m, w) g = g est Square deviation: g est = ∑ gP( g | m, w) g =0 All errors equally bad: g est = g * The median The mean The maximum of the posterior Assessing the evolutionary divergence of two genomes What to do if we are forced to make one specific estimate gest of g? m=2,w=5 mean: MAP: g* = 143 median: g median = 159 g = 166 Comparing two classes of genes The G genes can be divided into R regulatory genes and N=G-R non-regulatory genes. G Among the m mutants are mr regulators and mn non-regulators. Among the w wild-type are wr regulators and wn non-regulators. What do we now know about the relative frequency of occurrence of mutants among regulators and non-regulatory genes? Comparing two classes of genes We need the prior for gr regulatory mutants and gn non-regulatory mutants: P ( g r , g n | R, N ) = P ( g r | R ) P ( g n | N ) = 1 ( R + 1)( N + 1) We need the likelihood to observe the sample P (mr , wr , mn , wn | g r , R, g n , N ) = P (mr , wr | g r , R) P(mn , wn | g n , N ) = g r !( R − g r )!( R − g r − wr )! g n !( N − g n )!( N − g n − wn )! ( g r − mr )!( R − g r − wr )! R! ( g n − mn )!( N − g n − wn )! N ! So the posteriors become: P ( g | m , w , R ) = r r r g r !( R − g r )! Z ( g r − mr )!( R − g r − wr )! With Z and Z’ normalizing g n !( N − g n )! P ( g | m , w , N ) = n n n constants Z ' ( g n − mn )!( R − g n − wn )! Comparing two classes of genes Example: R = 100, mr = 2, wr = 5, N = 400, mn = 4, wn = 23 P ( g r | 2,5,100) P ( g n | 4,23,400) The probability that the fraction of regulatory mutants is larger than the fraction of non-regulatory mutants is gn P( g r R > gn N )= R 4 g r −1 ∑ ∑ P( g ) P( g g r =0 g n =0 gr r n ) Comparing two classes of genes Example: R = 100, mr = 2, wr = 5, N = 400, mn = 4, wn = 23 P ( g r | 2,5,100) P ( g n | 4,23,400) gn gn = 4gr We find P( gr R > gn N ) = 0.835 How would orthodox statistics answer this question? gr Comparing two classes of genes Null Hypothesis: Regulators are equally likely to be mutants as non-regulator genes. Formally: We are given (mr + wr ) regulators and (mn + wn ) non-regulators and pick the (mr + mn ) mutants at random from the (mr + mn + wn + wr ) genes. (mr + wr ) regulators (mn + wn ) non-regulators (mr + mn ) mutants ( wr + wn ) wild-type Now calculate the probability to end up with mr regulators among the mutants. Comparing two classes of genes Null Hypothesis: Regulators are equally likely to be mutants as non-regulator genes. (mr + wr ) regulators (mn + wn ) non-regulators ( wr + wn ) wild-type (mr + mn ) mutants According to the null hypothesis, the probabiliy to draw mr regulators among the mr+mn mutants is: P(mr ) = ⎛ mr + wr ⎞⎛ mn + wn ⎞ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ m m n r ⎝ ⎠⎝ ⎠ ⎛ mr + mn + wr + wn ⎞ ⎟⎟ ⎜⎜ m m + r n ⎠ ⎝ Comparing two classes of genes Null Hypothesis: Regulators are equally likely to be mutants as non-regulator genes. ⎛ mr + wr ⎞⎛ mn + wn ⎞ ⎟⎟ ⎜⎜ ⎟⎟⎜⎜ mr ⎠⎝ mn ⎠ P(mr ) = ⎝ ⎛ mr + mn + wr + wn ⎞ ⎜⎜ ⎟⎟ mr + mn ⎝ ⎠ We observe: mr = 2, wr = 5, mn = 4, wn = 23 ⎛ mr + wr ⎞⎛ mn + wn ⎞ ⎛ 7 ⎞⎛ 27 ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ k ⎟⎜ m + m − k ⎟ 6 ⎜⎜ k ⎟⎟⎜⎜ 6 − k ⎟⎟ m r + mn ⎠⎝ n r ⎠ = ⎝ ⎠⎝ ⎠ = 0.36 P(k ) = ⎝ ∑ ∑ ⎛ mr + mn + wr + wn ⎞ ⎛ 34 ⎞ k = mr k =2 ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ mr + mn ⎝6⎠ ⎝ ⎠ The p-value at which the null hypothesis is rejected is 0.36 Comparing two classes of genes Examples of posterior probabilities and p-values of the null hypothesis test mr wr mn wn P( gR > gN ) p-value 2 3 4 2 2 2 5 4 3 5 5 5 r 4 4 4 2 1 0 23 23 23 25 26 27 0.84 0.95 0.99 0.94 0.98 0.99 n 0.36 0.14 0.04 0.18 0.10 0.04 The p-values might seem very conservative. However, remember the p-value is essentially only asking how plausible the data is given the null hypothesis. Comparing two classes of genes Model Selection: Independent versus equal fractions of mutants. We assume that either the fractions of mutants are equal or independent: P( g r , g n | R, N ) = P ( g r , g n | R, N , equal) P(equal) + P( g r , g n | R, N , indep) P(indep) P( g r , g n | R, N , indep) = P( g r , g n | R, N , equal) = δ ( gR , gN ) r n (min( R, N ) + 1) Prior assuming the fractions are equal. 1 ( R + 1)( N + 1) Prior assuming the fractions are independent We now want to calculate the posterior probability that the fractions are equal given the data: P(indep | mr , wr , mn , wn , R, N ) Comparing two classes of genes Model Selection: Independent versus equal fractions of mutants. Bayes theorem gives: P(indep | Data ) = P (Data | indep) P(indep) P(Data | indep) P (indep) + P(Data | equal) P (equal) The probability of the data depend on gr and gn. Probability theory tells us that we can simply sum these nuisance parameters out of the problem: R N ∑ ∑ P(Data, g r , g n | indep) = P(Data | indep) = g r =0 g n =0 P(Data | equal) = R N ∑ ∑ P(Data, g r , g n | equal) = g r =0 g n =0 R N ∑ ∑ P(Data | g , g g r =0 g n =0 R r n ) P( g r , g n | indep) n ) P( g r , g n | equal) N ∑ ∑ P(Data | g , g g r =0 g n =0 r Comparing two classes of genes Model Selection: Independent versus equal fractions of mutants. In our case we have specifically: P (mr , wr , mn , wn | indep) = P(mr , wr | indep) P (mn , wn | indep) Using N g!( N − g )!( N − m − w)! 1 m! w! = (m + w + 1)! g = 0 ( g − m)!( N − g − w)! N ! ( N + 1) N P (m, w | indep) = ∑ P (m, w | g ) P( g | indep) = ∑ g =0 We obtain: P(Data | indep) = mn ! wn ! mr ! wr ! (mr + wr + 1)! (mn + wn + 1)! We can obtain P(Data | equal) in a similar way by summing out g r and g n (although the result is not a nice analytical expression.) Comparing two classes of genes Model Selection: Independent versus equal fractions of mutants. P(Data | indep) Average the probability of the data over all combinations of g r and g n . P(Data | equal) Average the probability of the data only over the line g n = 4 g r (red box). Comparing two classes of genes Model Selection: Independent versus equal fractions of mutants. For the case: mr = 2, wr = 5, R = 100, mn = 4, wn = 23, N = 400 P(Data | indep) = 1.21*10 −8 P(Data | equal) = 2.16 *10 −8 If we assume that “equal” and “independent” are a priori equally likely: P(indep) = P (equal) = 1 2 The posterior becomes: 2.16 *10 −8 12 P(equal | Data ) = 1.21*10 −8 12 + 2.16 *10 −8 The “equal” model is thus slightly preferred. 1 2 = 0.64 Comparing two classes of genes Examples of posterior probabilities, p-values of the null hypothesis test, and posterior probability of the model selection. mr wr mn wn P( gR > gN ) p-value P (indep | Data) 2 3 4 2 2 2 5 4 3 5 5 5 0.36 0.59 0.84 0.50 0.64 0.83 r 4 4 4 2 1 0 23 23 23 25 26 27 0.84 0.95 0.99 0.94 0.98 0.99 n 0.36 0.14 0.04 0.18 0.10 0.04 Note how different the two posteriors are. Both the prior and the precise question can matter a great deal. Splice variation in a mouse Transcription Unit http://www.spaed.unibas.ch Splice variation in a mouse Transcription Unit cryptic exon http://www.spaed.unibas.ch Splice variation in a mouse Transcription Unit cryptic exon 4 different promoters are used in the transcripts that could have contained the exon. 1 2 3 4 4 http://www.spaed.unibas.ch Exon inclusion dependence on promoter usage Assume that we have the following data for a given cryptic exon: • n transcripts in total. • P different promoters used. • ip number of times exon included when promoter p is used. • ep number of times exon excluded when promoter p is used. • i total number of times exon included. • e total number of times exon excluded. Independent model: Each transcript has an independent probability f to include the exon, with a uniform prior probability for f. Dependent model: For a transcript from promoter p the probability of including the exon is fp, with a uniform prior probability over fp for each p. Exon inclusion dependence on promoter usage Probability of the data under the independent model: 1 1 0 0 P (data | indep) = ∫ f i (1 − f ) e P ( f )df = ∫ f i (1 − f ) e df = i!e! (e + i + 1)! Probability of the data under the dependent model: P 1 i p !e p ! ep ip ⎡ ⎤ P (data | dep) = ∏ ∫ f p (1 − f p ) df p = ∏ ⎢0 ⎥⎦ p =1 (i p + e p + 1)! p =1 ⎣ P With Bayes’ theorem: P(dep | data ) = P(data | dep) P(dep) P(data | dep) P(dep) + P(data | indep) P(indep) Exon inclusion dependence on promoter usage Probability of the data under the independent model: 1 1 0 0 P (data | indep) = ∫ f i (1 − f ) e P ( f )df = ∫ f i (1 − f ) e df = i!e! (e + i + 1)! Probability of the data under the dependent model: P 1 i p !e p ! ep ip ⎤ ⎡ P (data | dep) = ∏ ∫ f p (1 − f p ) df p = ∏ ⎢0 ⎥⎦ p =1 (i p + e p + 1)! p =1 ⎣ P With Bayes’ theorem: P(dep | data ) = P(data | dep) P(dep) P(data | dep) P(dep) + P(data | indep) P(indep) Using this we estimate that between 5% and 15% of mouse internal cryptic exons are included in a promoter-dependent way. Example dependency exon inclusion on promoter usage multiplicity 1x 3x 8x 1x 1x 1x P(dep | Data ) = 0.985 Ugt1a6: UDP glucuronosyltransferase 1 family, polypeptide A6 2x Prior Probabilities: Know thy ignorance Given hypotheses H, our prior probability P(H|I) represents our information I. There are two main methods for determining such priors: 1. Invariance of our problem under a group of transformations. 2. The maximum entropy method. Simple example, a “scale parameter”: • We want to model gene expression values e and need, before we see any data, some prior probability P(e) over expression values. • We assume we know nothing about expression values. • What is a reasonable prior P(e)? Prior Probabilities: Know thy ignorance What distribution P(e) expresses complete ignorance about expression levels e? • You might be tempted to suggest the uniform distribution: P (e) = constant • Think of transformations e → e' = f (e) that leave your state of unchanged. • In this case, if we are really ignorant of e it shouldn’t matter if the gene expression was measured in units for mRNAs per cell, mRNAs per ml of solution, or light intensity as measured by some optical scanner. • The distribution P(e) should thus be invariant under the scale on which e is measured. Prior Probabilities: Know thy ignorance What distribution P(e) expresses complete ignorance about expression levels e? We should thus have invariance under: e → e' = λe for any λ We thus demand that: P (λe) d (λe) = P (e) de ⇔ λP (eλ ) = P (e) Taking the derivative with respect to λ and setting λ = 1 we get P ' (e) = − P ( e) constant ⇔ P (e) = e e Thus instead of a uniform distribution we find that the distribution is uniform in the logarithm of the expression level. P(e)de = constant d log(e) Prior Probabilities: Know thy ignorance Assume we obtain a very large table of expression levels, what fraction of the numbers would we a priori expect to start with digit d? To start with digit d=1 the number e has to lie between 1 and 2, or 10 and 20, or 0.1 and 0.2, and so on. Notice: 0.2 2 20 c c c ⎛2⎞ = = = de de de c log ⎜ ⎟ ∫0.1 e ∫1 e 10∫ e ⎝1⎠ Similarly, for the first digit to be 2: c c c ⎛3⎞ = = = de de de c log ⎜ ⎟ ∫0.2 e ∫2 e 20∫ e ⎝2⎠ 0 .3 3 30 So in general, the probability for the first digit to be d is: log( dd+1 ) P(d ) = log(10) This is called Benford’s law Prior Probabilities: Know thy ignorance The actual frequencies of first digits in various collections of numbers follow Benford’s law. Carefully representing one’s ignorance can already make nontrivial predictions! The over-expressed pathway • • • A mutant strain of an organism has an altered phenotypic property X. We know that this is caused by overexpression of one or more genes in one of the three pathways A, B, or C. We measured the absolute number of mRNAs in a single cell of the mutant strain and found that there were: • 180 mRNAs from genes in pathway A. • 170 mRNAs from genes in pathway B. • 40 mRNAs from genes in pathway C. • We can only investigate one of the pathways in detail and want to guess which one is most promising to study. Assuming different kinds of prior information. The over-expressed pathway Assume we progressively receive more information: Stage 1: A B C 180 170 40 The total number of mRNAs of each pathway in the mutant cell. 10 100 50 Stage 2: The average number of mRNAs of each pathway in wild-type cells.. 10 75 Stage 3: 20 The average number of mRNAs per gene of each pathway in wild-type cells. At each state, which pathway seems most likely to be over-expressed in the mutant? Let’s take a poll to see if our common sense agrees. The over-expressed pathway 1. You only know the total mRNA in each pathway in the mutant A 180 B C 170 40 Although we have very little information to go by, if we are forced to guess which pathway is over-expressed my common sense says to pick the one with the highest number of mRNA, i.e. pathway A. Can probability theory tell us this? • A pathway is over-expressed if its mRNA count nm is bigger than the mRNA count n in a wild-type cell in the same condition. 1 • Being completely ignorant our prior for n is P(n) ∝ n • The probability that n < nm is: nm −1 P(n < nm ) = ∑ P(n) ∝ log(n n =1 m − 1) The over-expressed pathway 1. You only know the total mRNA in each pathway in the mutant 180 A P(n < nm ) = Thus we find: 170 B 40 C nm −1 ∑ P(n) ∝ log(n m n =1 − 1) P(n < nm ( A)) ∝ log(179) P(n < nm ( B)) ∝ log(169) P(n < nm (C )) ∝ log(39) Indeed this says the best guess is pathway A. The over-expressed pathway 2. You know the average number of mRNAs expressed in each pathway in wild-type cells. A 50 B 100 C 10 • What probability distribution P(n A , nB , nC ) best represents our information: that we only know the averages 50, 100, and 10? • What probability distribution P(n A , nB , nC ) is as “ignorant” as possible, while satisfying the right averages? • Can we quantify how much ignorance a distribution represents? The answer was given by Claude Shannon in 1948 Definition of an ignorance function Axiomas: 1. There exists a function H [P] that assigns a real number to each probability distribution P which quantifies the ignorance associated with that probability distribution. 2. It is a continuous function of its arguments. 3. For the uniform distribution over n variables, the function 1⎤ ⎡1 1 h( n) = H ⎢ , , L , ⎥ n⎦ ⎣n n should increase with n. 4. It should be consistent in that, if it can be calculated in multiple ways, it always gives the same result. In particular, it should be additive: ⎡ pb pc ⎤ H [ pa , pb , pc ] = H [ pa , pb + pc ] + ( pb + pc ) H ⎢ , ⎥ ⎣ pb + pc pb + pc ⎦ Definition of an ignorance function ⎡ pb pc ⎤ H [ pa , pb , pc ] = H [ pa , pb + pc ] + ( pb + pc ) H ⎢ , ⎥ ⎣ pb + pc pb + pc ⎦ pa pb H [ pa , pb , pc ] pc pa pd=pb+pc ( pb + pc ) pd H [ pa , pb + pc ] pb/(pb+ pc) pc /(pb+ pc) ⎡ pb pc ⎤ H⎢ , ⎥ ⎣ pb + pc pb + pc ⎦ The function H [ pa , pb , pc ] measures how ignorant we are of a, b, or c being the case. Now group b and c into category d. Now our ingorance about a, b, or c should equal our ignorance about a or d being the case, plus a fraction (pb+pc) of the time the additional ignorance about b or c given that d is the case. Definition of an ignorance function • Assume we have a uniform probability distribution over n hypotheses Pi = 1 , i = 1,2, L , n n by definition the ignorance is h(n) • Now divide the n possibilities into G groups, with the first g1 hypotheses in the first group, the next g 2 in the second group, and so on. Using the consistency requirement we have g ⎤ G g ⎡g g h( n) = H ⎢ 1 , 2 , L , G ⎥ + ∑ i h( g i ) n ⎦ i =1 n ⎣n n • If we set all g i = g = n G equal we get h( gG ) = h(G ) + h( g ) •The solution to this equation is: h(n) = K log(n) where K is a constant we can choose freely. Definition of an ignorance function • Now use the solution h(n) = log(n) and substitute this in the equation for general distributions G G G g ⎤ g g g ⎡g g ⎛g ⎞ H ⎢ 1 , 2 , L, G ⎥ = h(n) − ∑ i h( g i ) = log(n) − ∑ i log( g i ) = −∑ i log⎜ i ⎟ n ⎦ ⎣n n ⎝n⎠ i =1 n i =1 n i =1 n • But we can now interpret the fractions as general probabilities: pi = • So finally our solution becomes: n H [ p1 , p2 , L, pn ] = −∑ pi log( pi ) i =1 gi n The Entropy of a distribution n H [ p1 , p2 , L, pn ] = −∑ pi log( pi ) i =1 • Thermodynamics: Because this function has the same functional form as the entropy function of statistical physics Shannon called it entropy. • Yes-and-No questions: If we want to find out which hypothesis is true by asking yes/no questions, it takes on average H questions to find out. • Optimal coding: If a large number of samples are taken from the distribution, the shortest description of the whole sample will have size nH. Claude Shannon, 1948 Entropy measures Ignorance Back to the over-expression problem: The Maximum Entropy Principle • We will find the distribution P(n A , nB , nC ) that maximizes entropy under the constraint that is has the correct average values. • Any other distribution is inconsistent with our information: such a distribution would be less ignorant and would thus effectively assume things that we do not know. The over-expressed pathway 2. You know the average number of mRNAs expressed in each pathway in wild-type cells. 10 C 100 B 50 A • We need a distribution P(nr , n y , ng ) that maximizes H[P] conditioned on: ∑n A P(n A , nB , nC ) = 50, n A , n B , nC ∑n B P(n A , nB , nC ) = 100, n A , nB , nC ∑n C P (n A , nB , nC ) = 10 n A , n B , nC • Or, performing the sums we get: ∑n P (n A ) = 50, ∑ nB PB (nB ) = 100, ∑ nC PC (nC ) = 10 A A nA nB nC • Thus, since we have no information relating the pathways our solution will have independent distributions for A, B, and C: P(n A , nB , nC ) = PA (n A ) PB (nB ) PC (nC ) The over-expressed pathway Our general problem has the form: Find the distribution P (n) such that ∑ nP(n) = n the average matches a given value: av n and the entropy is maximized: − ∑ P(n) log[P(n)] = maximal n This is a variational problem that can be solved using the method of Lagrange multipliers. The solution satisfies: ⎡ ⎤ δ ⎢∑ (− P(n) log(P(n) ) − μP(n) − λnP (n) )⎥ ⎣ n ⎦ =0 δP(n) and is given by: e − λn P ( n) = Z with Z a normalizing constant. The over-expressed pathway e − λn The general form of the distribution is: P ( n) = Z 1 − λn Z is often called a Normalization requires: Z = ∑ e = −λ partition function 1− e n We set λ such that the average takes on the desired value. Note that: d log(Z ) − = dλ ∑ nP(n) n Z = n So we can solve for λ by solving: ⎡ nav + 1⎤ d log(Z ) e−λ λ = ⇔ = nav = − log ⎢ ⎥ dλ 1 − e −λ ⎣ nav ⎦ The over-expressed pathway So the maximum entropy distribution given average nav is: 1 ⎛ nav ⎞ ⎜⎜ ⎟⎟ P ( n) = (nav + 1) ⎝ nav + 1 ⎠ n And the solution for our case is: n n 1 ⎛ 50 ⎞ A 1 ⎛ 100 ⎞ B 1 ⎛ 10 ⎞ P ( n A ) = ⎜ ⎟ , P ( nB ) = , P ( n ) = ⎜ ⎟ ⎜ ⎟ C 51 ⎝ 51 ⎠ 101 ⎝ 101 ⎠ 11 ⎝ 11 ⎠ nC Thus, the probability that a wild-type cell would have less mRNAs in pathway A than the 180 that the mutant has is: n 180 1 ⎛ 50 ⎞ ⎛ 50 ⎞ P(n A < 180) = ∑ ⎜ ⎟ = 1 − ⎜ ⎟ ⎝ 51 ⎠ n = 0 51 ⎝ 51 ⎠ 179 = 0.97 The over-expressed pathway For all three pathways we have: 180 n 1 ⎛ 50 ⎞ ⎛ 50 ⎞ P(n A < 180) = ∑ ⎜ ⎟ = 1 − ⎜ ⎟ ⎝ 51 ⎠ n = 0 51 ⎝ 51 ⎠ 179 = 0.97 170 n 1 ⎛ 100 ⎞ ⎛ 100 ⎞ P(nB < 170) = ∑ ⎜ ⎟ = 1− ⎜ ⎟ 101 101 101 ⎝ ⎠ ⎝ ⎠ n =0 169 = 0.82 40 n 1 ⎛ 10 ⎞ ⎛ 10 ⎞ P(nC < 40) = ∑ ⎜ ⎟ = 1 − ⎜ ⎟ = 0.98 ⎝ 11 ⎠ n = 0 11 ⎝ 11 ⎠ 39 Now pathway C looks the most promising! This is, roughly speaking because the ratio is the largest for this pathway, i.e. 4. nmutant nav The over-expressed pathway 3. You know the average number of mRNAs per gene in each pathway in wild-type cells. A 75 C 10 B 20 • We now to break down the total expression in terms of the numberof genes with different expression levels. • Because the information about each color is still independent of the others we focus on a single color first (say A). • The expression states of genes in this pathway can be specified by a vector: (n1 , n2 , n3 L)meaning n1 genes with 1 mRNA, n2 genes with 2 mRNAs, etc. • The total number of mRNAs in the pathway is: • The total number of genes: n = ∑ ni i t = ∑ ini i • We again find the maximum entropy distribution P(n1 , n2 ,L) satisfying the constraints on the averages t and n . The over-expressed pathway ⎡ The variational equation now gives: δ⎢ ⎤ ∑ (− P log(P ) − cP − λnP − μtP )⎥ ⎣ n1 ,n2 ,L ⎦ =0 δP log(P (n1 , n2 , L) ) = C − λ ∑ ini − μ ∑ ni Which can be solved to give: i i The normalization constant C is again obtained from the partition function: Z= ∞ ∑e − ni ( λi + μ ) n1 , n2 ,L= 0 ⎡ ∞ − ni ( λi + μ ) ⎤ ∞ ⎡ 1 ⎤ = ∏ ⎢∑ e = ⎥ ∏⎢ − ( λi + μ ) ⎥ ⎦ i =1 ⎣ ni =0 ⎦ i =1 ⎣1 − e ∞ To fit the constraints we again take derivatives of the partition function: d log( Z ) ∞ 1 n =− = ∑ λi + μ dμ −1 i =1 e d log(Z ) ∞ i t =− = ∑ λi + μ dλ −1 i =1 e The over-expressed pathway Total mRNAs: mRNAs per gene: A B C 50 75 100 10 10 20 For pathway A the constraint for he total number of mRNAs gives: The average mRNA per gene gives for the expected number of expressed genes: Solving this numerically we find: ∞ ∑ eλ i =1 ∞ i Ai + μ A ∑ eλ i =1 1 Ai + μ A −1 = 50 50 = − 1 75 λ A = 0.0134, μ A = 4.73 The over-expressed pathway Total mRNAs: mRNAs per gene: A B C 50 75 100 10 10 20 Similarly, for the all three pathways we find the solutions: λ A = 0.085, μ A = 0.599 λB = 0.0134, μ B = 4.73 λC = 0.051, μC = 3.708 ∞ 1 − ( kλ + μ ) nk And the general form of the distribution is: P (n1 , n2 , K) = e ∑ k =1 Z ∞ However, we want the distribution P (t )of the total t = ∑ knk k =1 The over-expressed pathway Notice that the distribution P(nk ) for the number of genes nk that have k mRNAs expressed is independent for each k: ∞ ( P(n1 , n2 , L) = ∏ P(ni ) P(nk ) = e − ( kλ + μ ) nk 1 − e − ( kλ + μ ) i =1 • For t not too small the distribution P (t )of the total t = is thus a sum of many independent contributions. ∞ ∑ kn k =1 k As we will see in a minute, we can therefore approximate P (t ) by a Gaussian distribution: ⎛ (t − t )2 ⎞ ⎟ P (t ) ≈ C exp⎜ − 2 ⎜ ⎟ 2σ ⎝ ⎠ • Using this approximation we find for the standard-deviations of the distributions for A, B, and C: σ = 85.8, σ = 46.0, σ = 19.3 A B C ) The over-expressed pathway In summary, we find the following distributions for the total numbers t of mRNAs in wild-type cells for each pathway: ⎛ 1 ⎛ t A − 50 ⎞ 2 ⎞ ⎟ P (t A ) ≈ C A exp⎜ − ⎜ ⎜ 2 ⎝ 85.8 ⎟⎠ ⎟ ⎠ ⎝ ⎛ 1 ⎛ t B − 100 ⎞ 2 ⎞ ⎟ P (t B ) ≈ C B exp⎜ − ⎜ ⎜ 2 ⎝ 46.0 ⎟⎠ ⎟ ⎠ ⎝ ⎛ 1 ⎛ tC − 10 ⎞ 2 ⎞ ⎟ P (tC ) ≈ CC exp⎜ − ⎜ ⎜ 2 ⎝ 19.3 ⎟⎠ ⎟ ⎝ ⎠ The probabilities for each pathway that the mutant is over-expressed are given 180 170 40 by: ∫ P(t A )dt A = 0.91, ∫ P(t B )dt B = 0.93, ∫ P(tC )dtC = 0.91 0 0 0 So now we have a slight preference for pathway B. The over-expressed pathway Summary: • The entropy of a distribution quantifies the ‘ignorance’ associated with it. • We can use the maximum entropy principle to represent our information in a given situation. • If we know the average of a certain number of quantities the maximum entropy distributions take on the form: − f i = f i , P ( h) = e ∑ λi f i ( h ) i Z , Z = ∑e h − ∑ λi f i ( h ) i with Z the partition function. The constraints can be solved by taking derivatives of the partitition function: fi = − ∂ log( Z ) ∂λi Stochastic processes We can also use probability theory to describe processes that show variations which we cannot predict nor control. Probably the simplest example is a process in which a certain events happen irregularly but at a certain rate overall rate r per unit time. Examples: • A gene being transcribed to produce a new mRNA. • An mRNA being degraded. • A cell reproducing. • A cell dying. • A mutation occurring. • And so on. Stochastic processes What is the distribution P(t | r ) of the time until the next transcription of a gene if we only know the overall rate r? That is, we know that in a sufficiently large time T the number of times N that the gene is transcribed is given by: N r= The solution is given by the maximum entropy distribution: T P(t | r )dt = re − rt dt Notice that normally this distribution is derived by assuming a constant rate per unit time: dP (t | r ) = −rP (t | r ) ⇔ P(t | r ) = re − rt dt Thus probability theory tells us a constant rate is the most ‘ignorant’ assumption. Stochastic processes P(t | r )dt = re − rt dt How long does it take before n transcriptions have occurred? The probability to obtain the first transcription at t1, the second at t2, and so on until the nth transcription at time t is: P (t1 , t 2 , K , t n −1 , t | r ) = r n e − r ( t −tn−1 ) − r (tn−1 −tn−2 )L− rt1 = r n e − rt We then integrate out the n-1 nuisance parameters: r nt n −1 − rt P (t | n, r ) = r e ∫ dt1 ∫ dt 2 L ∫ dt n −1 = e (n − 1)! t1 t n−2 0 n − rt t t t Stochastic processes Time until the nth transcription: r nt n −1 − rt P(t | n, r ) = e (n − 1)! This is called a Gamma-distribution n=2 n=5 n=10 n=50 The moment generating function For each probability distribution P (t ) the associated moment-generating ∞ function G (k )is defined as: G (k ) = e − kt = ∫ P(t )e − kt dt 0 Formally, this is a Laplace transform of the function P (t ) It is called a moment generating function because of the property: ∞ d nG (k ) (− 1) = ∫ t n P(t )dt = t n n dk 0 k =0 k The generating function Gn (k ) of a convolution t = t1 + t 2 + L + t n is the product of generating functions: ∞ ∞ Gn (k ) = ∫ e P (t ) = ∫ dt1dt 2 L dt n e − kt1 − kt2 L− ktn P(t1 ) L P (t n ) = G (k ) n 0 − kt 0 The central limit theorem Let y be the average of n values xi where each xi has the same probability distribution P(x). Let G(k) be the generating function of P(x). The generating function for y is given by ⎛k⎞ G ' ( y ) = ∫ dx1dx2 L dxn e − k ( x1 + x2 +L+ xn ) / n P ( x1 ) L P ( xn ) = G⎜ ⎟ ⎝n⎠ Any smooth function raised to a very large power will be dominated by its behavior close to its maximum. So for very large n the function we can approximate: ⎡ ⎛ k ⎞ ⎤ G ' ( 0) k k 2 ⎛ G ' ' ( 0) G ' ( 0) 2 ⎞ ⎟= log ⎢G⎜ ⎟⎥ ≈ + 2 ⎜⎜ − 2 ⎟ ⎣ ⎝ n ⎠⎦ G (0) n 2n ⎝ G (0) G (0) ⎠ n − x k +k 2 k k2 ⎛k⎞ − x + 2 var( x) ⇔ G⎜ ⎟ ≈ e n 2n ⎝n⎠ var( x ) 2n n The central limit theorem n For large n we found: − x k +k 2 ⎛k⎞ G⎜ ⎟ ≈ e ⎝n⎠ var( x ) 2n The generating function of a Gaussian distribution is given by: G (k ) = ∫ dxe − kx e − ( x−μ )2 2σ 2 =e − kμ + k 2 σ2 2 We have thus established that the generating function of the average of n variables from the same distribution is a Gaussian distribution with mean x and variance var( x) / n. Conclusion: Adding many independent random contributions together leads to a Gaussian distribution of the sum. General Summary • Probability theory is the unique extension of logic to cases where our • information is incomplete. • A probability represents our state of knowledge. • We assign probabilities by formalizing precisely what it is we do and do not know about the problem at hand. • We can use symmetries (in equivalent situations we assign the same probabilities) to determine the probabilities. • We can use the maximum entropy principle to determine which probability distributions correctly represent partial information. • Bayes’ theorem allows us to update our probabilities in light of data. P (Data | h) P(h) P (h | Data ) = ∑ P(Data | h' ) P(h' ) h'