Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 1: Basic Probability Dr Fraser Daly adapted from a course by Dr N Pétrélis 1 Course outline • Basic probability (lecture 1) • Statistical estimation and testing (lecture 2) • Markov chains (lecture 3) • Models and algorithms in bioinformatics (lectures 4 and 5) 2 Suppose we are given two sequences from two species: is there a common ancestor? ggagactgtagacagctaatgctata gaacgccctagccacgagcccttatc Sequence length: 26 nucleotides 11 of 26 positions agree Conclusion? Generated ‘purely by chance’ or by some other mechanism? To answer this, we need to understand properties of random sequences. 3 Why do we need probability and statistics in bioinformatics? • modeling sequence evolution (Markov chains); • inferring phylogenetic trees (maximum likelihood trees); • gene prediction (hidden Markov models); • analysis of micro array data (multiple testing, multivariate statistics); • evaluating sequence similarity in BLAST searches (extreme values, random walks); • and much more! 4 Random variables A random variable is a quantity whose value depends on a random event. For example: 1. Toss a coin. Let X = 0 if we get heads, X = 1 if we get tails. X is a random variable. 2. Select two DNA sequences at random from a database. Let Y be the number of matches. Then Y is a random variable. 3. Let Z be the lifetime of a newly fitted lightbulb. Z is a random variable. X and Y are discrete random variables, Z is a continous random variable. 5 The most important feature of a random variable is its probability distribution: this tells us the probability of the random variable taking particular values. For discrete random variables, this can be specified by the probability mass function or by the (cumulative) distribution function. For continuous random variables, this can be specified by the probability density function or the (cumulative) distribution function. 6 Discrete random variables Let X be a discrete random variable taking values in a sample space S. For i ∈ S, we define the probability mass function of X to be pX (i) = P (X = i), The (cumulative) distribution function is FX (i) = P (X ≤ i) = X pX (j). j∈S,j≤i Note that both pX (i) and FX (i) are probabilities, so are between 0 and 1. These functions contains all essential information about the properties and behavior of X. 7 Continuous random variables For a continuous random variable Y we can also define the (cumulative) distribution function: FY (t) = P (Y ≤ t) = Z x≤t fY (x) dx, where the function fY (x) is the probability density function. The probability density function is not a probability! It is always positive, but can be larger than 1. 8 Example Toss a fair coin. Let X = 0 if we get heads, X = 1 if we get tails. X is a discrete random variable with sample space {0, 1}. The probability mass function is: pX (0) = P (heads) = 1 , 2 pX (1) = P (tails) = 1 . 2 The cumulative distribution function is 1 FX (0) = P (X ≤ 0) = P (X = 0) = , 2 FX (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 1 9 Example Suppose we generate a DNA sequence as follows: at each position 1, . . . , N in the sequence choose either a, c, g or t with probabilities pa, pc, pg and pt, independently of any other position. Compare 2 DNA sequences generated in this way. Let Y be the number of matches between them. What is the probability mass function of Y ? 10 Fix a position in the sequences. The probability of a match at that position is P (two ‘a’ or two ‘c’ or two ‘g’ or two ‘t’) = P (two ‘a’) + P (two ‘c’) + P (two ‘t’) + P (two ‘g’) 2 2 2 = p2 a + pc + pg + pt . Call this probability p (the match probability). 11 What is P (Y = k)? We must have k matches and N − k mismatches. This can happen in several different ways. Each one has probability pk (1 − p)N −k We add the probabilities to obtain our answer: P (Y = k) = pk (1 − p)N −k + · · · + pk (1 − p)N −k . How many terms do we add? N N! , k!(N − k)! k the number of ways of choosing k positions from a total of N positions (N choose k). = 12 So: P (Y = k) = N k pk (1 − p)N −k , k = 0, . . . , N. Any random variable which has a probability mass function like this is said to be a binomially distributed with parameters N and p. Think about N independent trials, each is either: · a success (with probability p), or · a failure (with probability 1 − p). Let Z count the number of successes. Then Z has a binomial distribution with parameters N and p. Z ∼ Bin(N, p). 13 Think about our N independent trials again, each with success probability p. Let X be the trial on which we have our first success. pX (k) = P (k − 1 failures, then a success) = (1 − p)k−1p, for k = 1, 2, . . .. In this case, we say that X has a geometric distribution with parameter p. 14 Yet another important probability distribution is the uniform distribution. Suppose we have N possible outcomes of our random variable Z, each of which are equally likely. Then Z has a uniform distribution, with 1 pZ (i) = , N for any i ∈ S. For example, if we choose either a, c, g or t ‘uniformly at random’, each has probability 0.25 of being chosen. 15 Of course, there are infinitely many different probability distributions, but some occur time and time again in applications. As well as the binomial, uniform and geometric, these include: normal Poisson exponential negative binomial chi-square Student’s t beta gamma 16 Events Remember that S is our sample space. This is the set of possible outcomes of our ‘experiment’. An event A is something that either will or will not occur in our ‘experiment’, so that A ⊆ S. Examples: Roll a die once. Then S = {1, 2, 3, 4, 5, 6}. The event A1 that ‘the number we roll is odd’ can be written A1 = {1, 3, 5}. The event A2 that the number we roll is at least three can be written A2 = {3, 4, 5, 6}. 17 Suppose A, A1, A2 are events (subsets of our sample space). We can use set operations to construct new events. • Ac : ‘A does not occur’ (complement) Ac = {j ∈ S : j 6∈ A}. • A1 ∪ A2 : ‘either A1 or A2 occurs’ (union) A1 ∪ A2 = {j ∈ S : j ∈ A1 or j ∈ A2}. • A1 ∩ A2 : ‘both A1 and A2 occur’ (intersection) A1 ∩ A2 = {j ∈ S : j ∈ A1 and j ∈ A2}. 18 Computing probabilities of events Remember that 0 ≤ P (A) ≤ 1. P (S) = 1, P (Ac) = 1 − P (A), P (A1 ∪ A2) = P (A1) + P (A2) − P (A1 ∩ A2). Events A1 and A2 are called mutually exclusive if they cannot happen together (the intersection A1 ∩ A2 is the empty set). In this case P (A1 ∪ A2) = P (A1) + P (A2). 19 Conditional probability Suppose we roll a fair die once. The probability of getting an odd number (either 1, 3 or 5) is 1 2. But, suppose we are told that the number we rolled was (strictly) bigger than 3. How does this affect the probability? We know the number rolled was either 4, 5 or 6. Only one of these is odd, so given our knowledge that we rolled a number bigger than 3, the probability we got an odd number is only 1 3. 1 P (odd number | number bigger than 3) = . 3 This is a conditional probability. 20 More generally, let A1 and A2 be two events, with P (A2) > 0. The conditional probability P (A1|A2) is defined to be the probability that event A1 occurs, given that event A2 occurs. It can be calculated using the formula P (A1 ∩ A2) P (A1|A2) = . P (A2) 21 Example Consider our dice example again: P (odd number | number bigger than 3) P (odd number bigger than 3) = P (number bigger than 3) = P ({5}) P ({4, 5, 6}) 1 = . 3 22 Independence Two events A1 and A2 are said to be independent if P (A1 ∩ A2) = P (A1)P (A2). This is equivalent to P (A1|A2) = P (A1), and P (A2|A1) = P (A2). That is, knowing whether A2 happened tells us nothing about whether A1 happened (and vice versa). Think of two random variables being independent if knowing the value of one tells us nothing about the value of the other. 23 It is not always obvious when we have independence. In most applications we have dependence: independence is the exception, not the rule! For example, two DNA sequences linked by evolution are dependent. 24 Suppose we generate two random DNA sequence. Let Y be the number of matches. If the nucleotide in each position is independently chosen with probabilities pa, pc, pg and pt, we have already seen that Y has a binomial distribution. But, if there is dependence between the nucleotides chosen, we no longer have a binomial distribution. Why? 25 There could be lots of different types of dependence. Consider an extreme example. Choose the nucleotide in position 1 with probabilities pa, pc, pg and pt. Then set the nucleotides in positions 2, . . . , N to be the same as that in position 1. The only possible sequences are aaa · · · aaa ccc · · · ccc ggg · · · ggg ttt · · · ttt So, the only posible number of matches are 0 and N . Y cannot possibly have a binomial distribution. 26 Expectation and variance With a random variable X (or a probability distribution) we can associate some important quantities. These can give us idea of how the random variable behaves (its location and spread). • The expected value µ = E[X]. • The variance σ 2 = Var(X). • The standard deviation σ = SD(X) = q Var(X). 27 Expected value Suppose we generate two DNA sequences of length N = 1000, with nucleotides chosen independently and uniformly. That is, pa = pc = pg = pt = 0.25 Let Y be the number of matches. From before, we know that Y ∼ Bin(1000, 0.25). How many matches do we expect to see ‘on average’ ? Intuitively, it should be about 1000 × 0.25 = 250. In fact, if Y ∼ Bin(N, p), we can show that its expected value is E[Y ] = N p. This agrees with our intuitive answer. 28 For any discrete random variable X with state space S, we can define its expected value µ = E[X] by E[X] = X iP (X = i). i∈S Example: Roll a fair die and let X be the number we see. 1 1 1 E[X] = 1 · + 2 · + · · · + 6 · = 3.5. 6 6 6 Note that the expected value is not necessarily one of the possible values of X. Example: If Y ∼ Bin(N, p) then N X N i E[Y ] = i p (1 − p)N −i = N p. i i=0 29 One interpretation: Repeat the experiment many times. Take independent observations X1, . . . , Xn each with the same distribution as X. Take the mean of the results you obtain: 1 X1 + X2 + · · · + Xn . n As n gets large, this mean gets closer and closer to E[X]. More precise statements can be made. 30 Properties of expectation Expectation has a very nice linearity property: Let X1, X2, . . . , Xn be random variables (dependent or independent), and let c1, . . . , cn be real numbers. Then E[c1X1 + · · · + cnXn] = c1E[X1] + · · · + cnE[Xn]. If X and Y are independent random variables then E[XY ] = E[X] · E[Y ] This last result is not true if X and Y are dependent. 31 Let X be a discrete random variable and let g be any function. We can define E[g(X)], the expected value of g(X): E[g(X)] = X g(i)P (X = i). i∈S For example: E[X 2] = X i2P (X = i). i∈S 32 There is a similar formula for X a continuous random variable with probability density function fX (x): E[g(X)] = Z g(x)fX (x) dx. Warning: E[g(X)] 6= g(E[X]). 33 Suppose that X ∼ Bin(1000, 0.25). We know that E[X] = 250. So, we expect observations around 250. 251 or 249 would not be surprising results, but what about 240? 200? 170? The expectation gives us a measure of location. It is also useful to have a measure of spread. How much variability do we expect in our observations of X? 34 For a random variable X, we define the variance of X, σ 2 = Var(X): h Var(X) = E (X − E[X])2 i = E[X 2] − E[X]2. The second form is usually easier for calculations. The standard deviation of X, σ is defined by: σ = SD(X) = q Var(X). A high variance indicates a high deviation from the mean. 35 The term (X − E[X])2, is the (squared) distance between X and its expected value. In some sense, σ is the ‘average deviation of X from its mean’. 36 Properties of variance For any random variable X and constants a and b: Var(aX + b) = a2Var(X). If X and X are independent random variables then Var(X + Y ) = Var(X) + Var(Y ). If X and Y are dependent random variables then Var(X + Y ) = Var(X) + Var(Y ) + 2 · Cov(X, Y ). 37 Covariance We define the covariance between random variables X and Y : Cov(X, Y ) = E [(X − E[X]) · (Y − E[Y ])] = E[XY ] − E[X]E[Y ]. Covariance measures the linear dependence between X and Y . Properties: • If X and Y are independent then Cov(X, Y ) = 0. • Cov(X, X) = Var(X). 38