Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1/2/13 I529: Machine Learning in Bioinformatics (Spring 2013) Definitions Probabilistic models – A model means a system that simulates the object under consideration – A probabilistic model is one that produces different outcomes with different probabilities (BSA) Probabilistic models Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Why probabilistic models Probability The biological system being analyzed is stochastic Or noisy Or completely deterministic, but because a number of hidden variables effecting its behavior are unknown, the observed data might be best explained with a probabilistic model Experiment: a procedure involving chance that leads to different results Outcome: the result of a single trial of an experiment Event: one or more outcomes of an experiment Probability: the measure of how likely an event is – Between 0 (will not occur) and 1 (will occur) Templates for equations Yuzhen Ye January 1, 2013 Example: a fair 6-sided dice Random variable 1 Equation for slides Random variables are functions that assign a L Y (nisi /N ) unique number possible outcome of an R(S|P ) = to each experiment i=1 bsi An example Outcome: The possible outcomes of this experiment are 1, 2, 3, 4, 5 and 6 Events: 1; 6; even Probability: outcomes are equally likely to occur X= – Experiment: tossing a coin – Outcome space: {heads, tails} ⇢ – 1 if heads X= 0 if tails – P(A) = The Number Of Ways Event A Can Occur / The Total Number Of Possible Outcomes – P(1)=P(6)=1/6; P(even)=3/6=1/2; 2 (1) (2) – More exactly, X is a discrete random variable Mutual Information and others – P(X=1)=1/2, P(X=0)=1/2 X fx y M Iij = fxi yj log2 i j fxi fxj xy (3) i j 2 2/5 3 3/5 M Iij = log2 + log2 =0 5 1 ⇤ 2/5 5 1 ⇤ 3/5 k( i , k( i , j) = j) = hLi e 1 X n=m=0 k( i , j) = 1 X n=0 n n n!n! Hi n m n!m! hLi Hin LTi , Lj Hjn LTj i + LTi , Lj e Hj LTj i (5) hLi Hin LTi , Lj Hjm LTj i 1 X n=m=0,n6=m n m n!n! (4) hLi Hin LTi , Lj Hjm LTj i (6) (7) 1 1/2/13 Templates for equations Yuzhen Ye January 1, 2013 Probability distribution 1 Probability Equation for slidesmass function (pmf) L Y (n A probability mass function isisai /N function that ) R(S|P ) = gives the probability thati=1 a discrete random bs i variable is exactly equal to some value; it is often the primary means ⇢ of defining a discrete 1 if heads X= probability distribution 0 if tails An example Probability distribution: the assignment of a probability P(x) to each outcome x. A fair dice: outcomes are equally likely to occur the probability distribution over the all six outcomes P(x)=1/6, or 6. Templates forx=1,2,3,4,5 equations Templates for equations A loaded dice: outcomes are unequally likely to occur the probability over the all six Yuzhen Yuzhen Yedistribution Ye outcomes P(x)=f(x), Templates forx=1,2,3,4,5 equationsor 6, but ∑f(x)=1. P (X) = JanuaryJanuary 2, 2013 2, 2013 Yuzhen Ye 1 1 Equation for slides January 2, 2013 1 Equation for slides Equation for slides random random variable variable L L Y Y (nis /N )(nisi /N ) R(S|P ) = R(S|P ) =i bs i bs i i=1 2 (1) k( i , (2) (2) (3) = hLi e The probability of1 a DNA sequence Event: Observing a DNA sequence S=s1s2…sn: si ∈ {A,C,G,T}; Random sequence model (or Independent and identically-distributed, i.i.d. model): si occurs at random with the probability P(si), independent of all other residues in the sequence; n P(S)= ∏ P (si ) i =1 This model will be used as a background model (or called a null hypothesis). Hi LTi , Lj e Hj LTj i (3) (4) n m j) = (2) 1 (3) (3) (4) (4) (4) 1 X 1 n n n m X n joint T n T nY T m T – P(X,Y) probability (distribution) Yuzhen Yeof XhLand hLi H i H i L i , Lj H j L j i i L i , Lj H j L j i + n!n! n!n! n=0 – P(X,Y)=P(X|Y)P(Y)=P(Y|X)P(X) n=m=0,n6=m January 2, 2013 – P(X|Y)=P(X), X and Y are independent (5) – P(y): y=D1 or D2 L Y (nisi /N ) – P(i, D1)=P(i| D1)P(D1) R(S|P ) = bs i i=1 – P(i| D1)=P(i| D2), independent events (5) (1) random variable X= (5) (5) (6) Example: experiment 11 (selecting a dice), experiment 2 (rolling Equation for slidesthe selected dice) 1 P (x, y)1 P (x, y) =R 1 P (y) P (x, y)dy (2) Two experiments (random variables) X and Y k( i , P (X = x) = Z 0b Z b P (a XP f (x)dx (ab)=X b) = f (x)dx Z ab a P (a X b) = f (x)dx marginalmarginal probability probability a Cumulative distribution (cdf) Z Z x function Z P (x) ==P (x) P (x, =fy)dy P (x, y)dy P (X x) (t)d(t) P (x|y) ⌘ j) 1 X A pdf must be integrated over an interval to yield conditional probability f xi y j fxi fxj Joint k( probability hLi Hin LTi , Lj Hjm LTj i i, j ) = n!m! n=m=0 Templates for equations probability densityafunction probability, since P (X = a) = 0= x) = 0 P (X Z P (x, y) P (x,Py)(x, y) P (x, y) (x) = ⌘ P (x, P (x|y)P ⌘ = Ry)dy = R P (x|y) P (y) P (x, y)dyP (x, y)dy P (y) fxi yj log2 2 2/5 3 3/5 M Iij = log2 + log2 =0 5 1 ⇤ 2/5 5 1 ⇤ 3/5 (1) Probability density function (pdf) conditional probability marginal probability conditional probability X xi y j i=1 L Y (nis /N ) R(S|P⇢) = ⇢ i 1 ifi=1heads 1bsifi heads X= X= 0 if tails 0 if tails random variable ⇢ 1 if heads probability mass function probability mass function X= 0 if tails 8 8 probability mass function < 1/2 heads < 1/2 heads Probability density functions (pdf) are for 1/2 P (X) =8 P (X) =tails1/2 tails : : 0 than others heads 0 others random < 1/2 continuous rather discrete P (X) variables; f(x)= 1/2 tails probability density function probability density function: 0 others 8 < 1/2 heads 1/2 tails : 0 others Mutual Information and others M Iij = (1) (1) ⇢ 1 if heads 0 if tails probability mass function P (X) = (6) probability density function Marginal probability 8 < 1/2 heads 1/2 tails : 0 others P (X = a) = 0 (the The distribution of the marginal variables marginal distribution) is obtained by marginalizing Z b over the distribution of the variables being discarded P (a X b) = f (x)dx (so the discarded variables are marginalized out) a marginal P(X|Y)P(Y) probability P(X)=∑ Y Z Example: experiment 1 (selecting a dice), P (x) = P (x, y)dy experiment 2 (rolling the selected dice) (2) (3) (4) – P(y): y=D1 probability or D2 conditional – P(i) =P(i| D1)P(D1)+P(i| D2)P(D2) – P(i| D1)=P(i| D2), independent events P P(i| (x|y) – P(i)= P(i| D1)(P(D1)+P(D2))= D1)⌘ P (x, y) P (x, y) =R P (y) P (x, y)dy (5) 1 2 1 if heads 0 if tails X= probability mass function P (X) = probability density function 8 < 1/2 heads 1/2 tails : 0 others 1/2/13 P (X = x) = 0 P (a X b) = P (X x) = Z Z (2) b f (x)dx x f (t)d(t) 1 Conditional probability marginal probability Conditioning the joint distribution on a particular Z observation P (x) = P (x, y)dy Conditional probability P(X|Y): the measure of how conditional probability likely an event X happens under the condition Y; P (x|y) ⌘ (3) a P (x, y) P (x, y) =R P (y) P (x, y)dy (4) Probability models (5) A system that produces different outcomes with different probabilities. (6) It can simulate a class of objects (events), assigning each an associated probability. – Example: two dices D1, D2 Simple objects (processes) probability distributions • P(i|D1) probability for 1 picking i using dicer D1 • P(i|D2) probability for picking i using dicer D2 Binomial distribution Typical probability distributions Binomial distribution Gaussian distribution Multinomial distribution Poisson distribution Dirichlet distribution An experiment with binary outcomes: 0 or 1; Probability distribution of a single experiment: P(‘1’)=p and P(‘0’) = 1-p; Probability distribution of N tries of the same experiment ⎛ N ⎞ k N −k Bi(k ‘1’s out of N tries) ~ ⎜⎜⎝ k ⎟⎟⎠ p (1 − p) Probabilitydistributions distributions Probability YuzhenYeYe Yuzhen January2, 2,2013 2013 January CommonDistributions Distributions Gaussian distribution 1 1 Common Gaussian distribution When N -> ∞, Bi -> Gaussian distribution Gaussian Distribution. • •Gaussian Distribution. The Gaussian (normal) distribution is aprobability distribution that The Gaussian normal) distribution a continuous The Gaussian (or(or normal) distribution is is a continuous probability distribution that continuous probability distribution with probability has a bell-shaped probability density function as: has a bell-shaped probability density function as: density function defined as: 1 x µ 2 1 2 f (x; ) = p1p e e12 ( x2 ( µ )2 ) f (x; µ,µ, ↵2↵) = (1)(1) 2⇡2⇡ µ: mean (expectation); σ :2 variance (σ: the standard where the mean or expectation, and 2 is is variance the standard deviation). where µµ is is the mean or expectation, and variance ( ( is is the standard deviation). derivation) If we define a new variable u= (x µ)/ µ)/, we , we have, If we define a new variable u= (x have, If we define a new variable u=(x-µ)/σ 2 u2 /2 1 1 eu2 /2 f (x) f (x) ⇠ ⇠p p2⇡ e 2⇡ (2)(2) Figure from Wikipedia standard normal distribution when µ = 0 and σ2 =1 Binominal Distribution. • •Binominal Distribution. Thebinomial binomialdistribution distributiongives givesthe thediscrete discreteprobability probabilitydistribution distributionof ofobtaining obtaining The exactly successes out Bernoulli trials (where the result each Bernoulli trial exactly nn successes out of of N NBernoulli trials (where the result of of each Bernoulli trial true with probability p and false with probability q= is is true with probability of of p and false with probability of of q= 1 1 p):p): N n (N n) n) P (n|N ) =N npnpq (N q P (n|N )= n (3)(3) Theprobability probabilityof ofobtaining obtainingmore moresuccesses successesthan thanthe then nobserved observedin ina abinomial binomialdisdisThe tribution tribution is:is: N NX X N k (N k) q k) N k kp (N P P= = (4)(4) k p q k=n+1 k=n+1 3 1/2/13 events occur with a known average rate ( ) and independently of the time since the last event (i.e., it is a Poisson process): p (n) = n e (5) n! Multinomial distribution I receive 10 emails everyday on average, what’s my chance of receiving Example: a fair dice Example 1: no email today? Example 2: An experiment with K independent outcomes Probability: outcomes (1,2,…,6) are equally likely to occur with probabilities θi, i =1,…,K, ∑θi =1. • Multinominal Distribution. Probability distributionoutcomes of N tries of the same An with K independent with probabilities ✓i for i = 1, , K, P experiment experiment, getting nn i occurrences of outcome i, ∑ni ✓ = 1. The probability of getting occurrences of outcome i is giving by (n = i i =N (n={ni}). {ni }), K Y P (n|✓) = M 1 (n) ✓ini (6) Probability of rolling 1 dozen times (12) and getting each outcome twice: – 12 () 12! 1 26 6 ~3.4×10-3 i=1 M (n) = Q ni ! n1 !n2 ! · · · nK ! P = Pi ( k nk )! ( k nk )! (7) • Gamma Distribution. The gamma distribution has been used to model the rate of evolution at di↵erent sites in DNA sequences. The gamma distribution is conjugate to the Poisson which gives the probability of seeing n events over some interval, when there is a probability p of an individual event occurring in that interval. Since the number of events in an interval is a rate, the gamma distribution is appropriate for modeling probabilities of rates. • Dirichlet Distribution In Bayesian statistics, we need distributions over probability parameters to use as prior distributions. A natural choice for a density over probaa loaded dice bilities is theExample: Dirichlet distribution: 2 2.1 € Poisson distribution Poisson gives the probability of seeing n events over some interval, when there is a probability p of an individual event occurring in that period. Probability: outcomes (1,2,…,6) are unequally likely to occur: P(6)=0.5, Logistic Linear Regression P(1)=P(2)=…=P(5)=0.1 Fitting Liker other regressions, logistic regression makes use of one or more predictor variables that may be either continuous or categorical data. Also, like other linear regression models, the Probability of rolling 1 dozen times (12) and expected value (average value) of the response variable is fit to the predictors—the expected getting each outcome twice: value of a Bernoulli distribution the probability of a case. But, unlike ordinary 2 is simply 10 12! -4 – 2regression 6 (0.5) × (0.1) linear regression, logistic is used~1.87×10 for predicting binary outcomes (Bernoulli trials) instead of continuous values. Given this di↵erence, it is necessary that logistic regression take the natural logarithm of the odds (referred to as the logit or log-odds) to create a € 2 Poisson distribution for sequencing coverage modeling Poisson distribution C Assuming uniform distribution of reads: Length of genomic segment: L Number of reads: n Length of each read: l Coverage λ = n l / L How much coverage is enough (or what is sufficient oversampling)? Lander-Waterman model: P(x) = (λx * e-λ ) / x! P(x=0) = e-λ where λ is coverage 4 K Y P (n|✓) = M 1 (n) ✓ini (6) • Gamma Distribution. i=1 The gamma distribution has been used to model theQrate of evolution at di↵erent ni ! n1 !n sites in DNA sequences. The gamma distribution to the Poisson which 2 ! · · · nK ! is conjugate P M (n) = = Pi (7) ( k nwhen gives the probability of seeing n events( over interval, there is a probability k )! k )! k nsome p of an individual event occurring in that interval. Since the number of events in an •interval Gamma is Distribution. a rate, the gamma distribution is appropriate for modeling probabilities of The gamma distribution has been used to model the rate of evolution at di↵erent rates. 1 ↵ sites in DNA sequences. The gamma is conjugate to the Poisson which e x x↵ distribution g(x,of↵,seeing ) = n events over fsome or 0 interval, < x, ↵, when < 1 there is a probability (8) gives the probability (↵) p of an individual event occurring in that interval. Since the number of events in an • Dirichlet Bayesian statistics, need distributions overprobabilities probabilityof interval Distribution is a rate, the In gamma distribution is we appropriate for modeling parameters to use as prior distributions. A natural choice for a density over probarates. x ↵ 1 ↵ e x bilities is the Dirichlet distribution: g(x, ↵, ) = f or 0 < x, ↵, < 1 (8) (↵)K K Y ↵ 1 X Dirichlet Indistribution = Z 1 (↵) ✓i i we (need✓idistributions 1) (9) • Dirichlet DistributionD(✓|↵) Bayesian statistics, over probability i=1 i=1 parameters to use as prior distributions. A natural choice for a density over proba ,Outcomes: θ=(θ↵1, θ>2,…, θK)constants specifying the Dirichlet distribilities is ↵ the Dirichlet where ↵= , with 0, are 1 ↵ 2 , · · · , ↵Kdistribution: i bution, and the ✓i satisfies 0 ✓i 1 and sum to 1. K K Y X i 1 expressed in terms of the gamma The normalizing factor ZD(✓|↵) for the = Dirichlet Z 1 (↵) can ✓i↵be ( ✓i 1) (9) Density: function: i=1 i=1 Q Z Y K K X ↵i 1 iP (↵i ) ( are✓iconstants 1)d✓ = specifying (10) where ↵ = ↵1 , ↵2 ,Z(↵) · · · , ↵=K , with✓↵ the Dirichlet distrii i > 0, ( i ↵i ) i=1 sum to 1. bution, and the ✓i satisfies 0i=1 ✓i 1 and The normalizing factor Z for the Dirichlet can be expressed in terms of the gamma 2 different α function: (α1, α2,…, αK) are constants Q Z Y K K X (↵ gives different probability distribution over θ.i ) Z(↵) = ✓i↵i 1 ( ✓i 1)d✓ = iP (10) ( i ↵i ) K=2 Beta distribution i=1 1/2/13 Example: dice factories Dice factories produce all kinds of dices: θ(1), θ(2),…, θ(6) A dice factory distinguish itself from the others by parameters α=(α1,α2 ,α3 , α4 , α5 , α6) The probability of producing a dice θ in the factory α is determined by D(θ|α) i=1 2 Parameter estimation: MLE and MAP MLE Probabilistic model Selecting a model Estimating the model parameters (learning): Yuzhen Ye from large sets of trusted examples – A model can be anything from a simple distribution to a complex stochastic grammar with many implicit probability distributions – Probabilistic distributions (Gaussian, binominal, etc) – Probabilistic graphical models • • • • Markov models Hidden Markov models (HMM) Bayesian models Stochastic grammars January 2, 2013 1 MLE Given a set of data D (training set), find a model with parameters θ with the maximal likelihood P(D|θ) Data model (learning) – The parameters of the model have to be inferred from the data – MLE (maximum likelihood estimation) & MAP (maximum a posteriori probability) Model data (inference/sampling) ✓ˆM LE = arg max P (D|✓) ✓ 2 (1) MAP ✓ˆM AP = arg max P (✓|D) ✓ P (D|✓)P (✓) P (D) = arg max P (D|✓)P (✓) = arg max ✓ Parameter estimation: MLE and MAP Yuzhen Ye Example: a loaded dice January 2, 2013 Loaded dice: to estimate parameters θ1, θ2,…, θ6, based on N observations D=d1,d2,…dN 1 2 MLE θi=ni / N, where ni is the occurrence of i outcome (observed frequencies), thePmaximum ✓ˆM LE = arg is max (D|✓) ✓ likelihood solution (BSA 11.5) We can show that P (n|✓M LE ) > P (n|✓) for any ✓ 6= ✓M LE ) MAP Learning from counts ✓ˆM AP ✓ (2) Here the prior is used (as in Bayesian statistics): When to use MLEP (D|✓)P (✓) P (✓|D) = P (D) P (D|✓)P (✓) = is P A drawback of MLE that it can give poor (D|✓)P (✓) ✓ Pare estimations when the data scarce (1) (3) – E.g, if you flip coin twice, you may only get heads, then P(tail) = 0 It may be wiser to apply prior knowledge (e.g, we assume P(tail) is close to 0.5) – Use MAP instead = arg max P (✓|D) ✓ P (D|✓)P (✓) P (D) = arg max P (D|✓)P (✓) = arg max ✓ ✓ (2) 1 Here the prior is used (as in Bayesian statistics): P (✓|D) = = P (D|✓)P (✓) P (D) P (D|✓)P (✓) P ✓ P (D|✓)P (✓) 5 (3) Yuzhen Ye January 2, 2013 1 1/2/13 MLE ✓ˆM LE = arg max P (D|✓) (1) ✓ 2 MAP Parameter estimation: MLE and MAP ✓ˆM AP = arg max P (✓|D) ✓ Yuzhen P (D|✓)P (✓) Ye P (D) = arg max PJanuary (D|✓)P (✓) 2, 2013 = arg max ✓ MAP (2) ✓ Example: two dices Here the prior is used (as in Bayesian statistics): Bayesian statistics 1 MLE P (✓|D) = = 2 P (D|✓)P (✓) P (D) (✓) ✓ˆPM(D|✓)P LE = arg max P (D|✓) P ✓ ✓ P (D|✓)P (✓) P(θParameter ) prior probability estimation: MLE P(θ|D) posterior probability Yuzhen Ye MAP ✓ˆM AP MAP Prior probabilities: fair dice 0.99; loaded dice: 0.01; Loaded dice: P(6)=0.5, P(1)=…P(5)=0.1 (1) Data: 3 consecutive ‘6’es: (3) and MAP – P(loaded|3’6’s)=P(loaded)*[P(3’6’s|loaded)/P(3’6’s)] = 0.01*(0.53 / C) – P(fair|3’6’s)=P(fair)*[P(3’6’s|fair)/P(3’6’s)] = 0.99 * ((1/6)3 / C) – Model comparison by using likelihood ratio: P(loaded| 3’6’s) / P(fair|3’6’s) < 1 – So fair dice is more likely to generate the observation. = arg max P 2, (✓|D) January 2013 ✓ P (D|✓)P (✓) P (D) = arg max P (D|✓)P (✓) = arg max 1 ✓ MLE (2) ✓ ✓ˆM LE = arg max P (D|✓) (1) ✓ Here the prior is used (as in Bayesian statistics): We can show that P (n|✓ 1 M LE ) > P (n|✓) for any ✓ 6= ✓M LE 2 P (D|✓)P (✓) P (D) P (D|✓)P (✓) = P ✓ˆM AP = arg max P (✓|D) ✓ P (D|✓)P (✓) P (✓|D) = MAP (3) ✓ P (D|✓)P (✓) We see the posterior is itself like a Dirichlet distribution like the but with di↵erent = prior, arg max ✓ P (D) parameters. Now we have, = arg max P (D|✓)P (✓) (2) Z(n + ↵) ✓ P (n) = (6) Use prior knowledge when the data is scarce Z(↵)M (n) multinominal distribution, can use Dirichlet withdi↵erent parameters ↵ We seeFor thethe posterior is itself like a Dirichletwedistribution like thedistribution prior, but with We then perform an integral to find the Dirichlet posterior mean estimator, as prior for the Use distribution parameters. NowThe we have, as the prior. the posterior for the multinomial distribution with observations n: Z Z + ↵) Y Z(n multinomial distribution: (6) k 1 d✓ P (n|✓)D(✓|↵) (7) ✓i P=(n|✓)P ✓knk +↵(✓) ✓iP M E = ✓i D(✓|n + ↵)d✓ = Z 1 (n + ↵) P (n) (n) – Posterior P (✓|n) = k Z(↵)M = (3) P (n) P (n) We then perform an integral to find the posterior mean estimator, Learning from counts: including prior – Posterior mean estimator Z Z Z(n + ↵ + i ) distribution and the expression Y n +↵for 1D(✓|↵) yields, Inserting multinominal M= E 1 ✓iP M✓PEthe (7) = Z(n ✓i D(✓|n ✓k k(8)k d✓ i + ↵) + ↵)d✓ = Z (n + ↵) ✓i Y k 1 Z(n + ↵) n +↵ 1 i i PP(✓|n) ✓i = D(✓|n + ↵) M E = ni + ↵ i ✓i = P (n)Z(↵)M (n) (9) + ↵ + i ) P (n)Z(↵)M (n) i N + A P M E Z(n ✓i = (8) (4) – Equivalent to add αi as pseudo-counts to the ↵) Here the prior is used We (as don’t in Bayesian statistics): 1Z(n +functions have to get involved with gamma in order to finish the calculation, observation ni (BSAP11.5) ni + ↵ i ME because we know P that both P (✓|n) and D(✓|n + ↵) are properly normalized probability ✓ = (9) i (D|✓)P (✓)about statistics N + A – We can forget and use P (✓|D) over = ✓. This means that all the pre factors must cancel and, distributions (D) Here the prior is usedP(as in Bayesian pseudo-counts in thestatistics): parameter estimation! P (D|✓)P (✓) P (✓|n) = D(✓|n + ↵) (5) = P (10) (✓) = P (D|✓)P (✓) ✓ P (D|✓)P P (✓|D) P (D) P (D|✓)P (✓) = P 1 (10) ✓ P (D|✓)P (✓) Sampling Probabilistic model with parameter θ P(x| θ) for event x; Sampling: generate a large set of events xi with probability P(xi| θ); Random number generator ( function rand() picks a number randomly from the interval [0,1) with the uniform density; Sampling from a probabilistic model transforming P(xi| θ) to a uniform distribution – For a finite set X (xi∈X), find i s.t. P(x1)+…+P(xi-1) < rand(0,1) < P(x1)+…+P(xi-1) + P(xi) Entropy Mutual information Probabilities distributions P(xi) over K events Measure of independence of two random variable X and Y P(X|Y)=P(X), X and Y are independent P(X,Y)/P(X)P(Y)=1 M(X;Y)=∑x,y P(x,y)log[P(x,y)/P(x)P(y)] H(x)=-∑ P(xi) log P(xi) – Maximized for uniform distribution P(xi)=1/K – A measure of average uncertainty – 0 independent 2 2 6